Due Diligence for AI-Generated Codebases

Due diligence for AI-generated codebases — what traditional reviews miss

Your deal team just finished reviewing a promising SaaS acquisition target. Revenue is growing. Churn is low. The product demo is clean. Technical due diligence came back with decent test coverage, no critical CVEs, and a modern tech stack.

You close the deal. Three months later, your integration team tells you the codebase needs six months of restructuring before it can support the roadmap. The code looks professional on the surface but falls apart under real engineering scrutiny. Nobody can explain why the architecture is the way it is — because nobody designed it. An AI did.

This is happening right now, across the venture and PE landscape. And most deal teams aren’t equipped to catch it.

The New Reality of AI-Built Software

The numbers tell the story. 41% of all code on GitHub is now AI-generated, according to Copilot’s own data. A quarter of Y Combinator’s Winter 2025 batch was built with 95% or more AI-generated code. This isn’t a fringe movement.

The market has validated it. Wix acquired Base44 — a platform for building apps entirely through prompts — signaling that vibe-coded products are acquisition targets, not just prototypes. Lovable, an AI app builder, hit a $1.8 billion valuation. Cursor became the default editor for a generation of developers who treat AI as a co-pilot rather than a tool.

“Vibe coding” went from a meme coined by Andrej Karpathy to a legitimate development methodology in under a year. Founders aren’t embarrassed to say they built with AI anymore. They put it in their pitch decks.

Here’s the investment implication: the startups in your pipeline increasingly run on code that no human fully wrote, fully reviewed, or fully understands. Traditional technical due diligence — the kind that checks for test coverage, architecture patterns, and security vulnerabilities — doesn’t catch the problems this creates.

The code passes the surface-level checks. It fails under the kind of scrutiny that matters for a five-year hold or a platform integration.

Why Traditional Due Diligence Misses AI Code Problems

Traditional tech DD was designed for a world where humans wrote code deliberately. An engineer chose an architecture. A team debated trade-offs. Design decisions were recorded in ADRs or, at minimum, in commit messages. When you reviewed the codebase, you could trace the reasoning.

AI-generated code doesn’t work this way. It looks clean. The naming conventions are consistent. The formatting is proper. The tests pass. A cursory review suggests a well-maintained codebase.

But the institutional knowledge is missing. There are no design decisions because there were no decisions — just prompts. There’s no record of trade-offs because nobody weighed alternatives. The code exists because an AI predicted it was the most likely next token, not because an engineer determined it was the right approach.

A CodeRabbit study found that AI-authored pull requests have 1.7x more issues than human-written code. Not because the syntax is wrong. Because the judgment is missing.

Traditional DD checks for things like: Does the codebase have tests? Is there a CI/CD pipeline? Are dependencies up to date? Are there known security vulnerabilities? AI-generated codebases can pass every one of these checks while harboring deep structural problems that will cost you millions to fix post-acquisition.

The “works in the demo” problem is the most dangerous pattern. AI code is optimized for the happy path. The demo works flawlessly. The first 100 users have a great experience. But edge cases — network failures, concurrent users, malformed data, timezone boundaries, partial state — are where AI code falls apart. And edge cases are where production software lives.

There’s also the commit history problem. In a human-built codebase, you can read the git log and understand how the architecture evolved. Why was the payment module restructured in Q3? The commits tell you. In a vibe-coded codebase, the commit history is often “generated feature X” or “fixed bug Y” with no reasoning trail. You can’t trace design decisions because there weren’t any deliberate ones.

What to Look For in AI-Generated Codebases

Not all AI-generated code is the same. The tool matters. Each one produces distinct patterns, and knowing what to look for by tool gives your technical reviewers a significant advantage.

Cursor-generated code tends to be the highest quality of the AI tools because a human developer is in the loop. But the quality depends entirely on that developer. The signature pattern is inconsistency across the codebase — because each prompt had different context. Look for: repeated utility functions that do the same thing slightly differently, inconsistent error handling strategies (some files use try-catch, others use Result types, others silently swallow errors), and mixed architectural patterns where some files follow one approach and others follow another. We’ve written in detail about what goes wrong in Cursor-built codebases and the patterns are predictable.

Bolt.new code carries a fundamental constraint: it’s browser-based generation with no local development tooling. This means no proper testing infrastructure, no CI/CD, and no environment configuration. Look for: everything crammed into one directory, minimal separation of concerns, hardcoded configuration values, and no deployment pipeline. The code was built to run inside Bolt’s preview pane, not on production infrastructure. We’ve documented the specific scaling problems these codebases hit.

Lovable code generates full-stack applications, which means business logic gets mixed with UI code from day one. Look for: fat components that handle data fetching, state management, and rendering all in one file. Duplicated API calls across multiple components instead of a shared service layer. Hard-coded values everywhere. No abstraction between the frontend and the database. The patterns are consistent across every Lovable-built app we’ve reviewed.

Copilot-assisted code is the hardest to evaluate because it’s mixed in with human-written code. There’s no clear boundary between what the developer wrote and what Copilot suggested. The quality depends on whether the developer reviewed each suggestion or just hit Tab. Look for: code that is syntactically correct but semantically wrong — functions that handle the happy path perfectly and fail on the first edge case. We’ve covered what hides in Copilot-heavy codebases in depth.

General AI patterns your deal team should flag:

Test coverage that looks good on paper but tests don’t assert meaningful behavior. AI is great at writing tests that pass. It’s bad at writing tests that catch real bugs.
Error handling that catches everything and does nothing. Silent exception swallowing is an AI signature. The app doesn’t crash, but it also doesn’t handle failures.
API integrations that work against current library versions but use deprecated or fragile patterns that will break on the next update.
No defensive coding for network failures, race conditions, or partial data. The code assumes the world is always cooperative.
Duplicated logic across files instead of shared abstractions. AI generates what you ask for in isolation, so the same business rule gets implemented six different ways.

The IP and Licensing Question

This is where deal teams need to pay the most attention, and where they currently have the least playbook.

Who owns AI-generated code? The honest answer: it’s not fully settled. The legal landscape is evolving fast, and different tools have different terms.

GitHub Copilot was trained on public repositories, including those with GPL, AGPL, and other copyleft licenses. It can and does reproduce code patterns from those repositories. If a meaningful portion of the target’s codebase contains Copilot suggestions derived from copyleft code, you may be acquiring a licensing liability. GitHub’s terms give the user rights to the output, but that doesn’t resolve the upstream training data question.

Other tools have different terms. Cursor uses various underlying models. Lovable and Bolt.new generate code through their own pipelines. Each has different IP assignment language in their terms of service. Your legal team needs to map which tools were used and review each tool’s terms for IP transferability.

The practical questions for the data room:

What AI tools were used in development, and what percentage of the codebase did each produce?
Are there any code blocks that the team knows were directly generated without modification?
Has anyone run license scanning on the codebase for copyleft contamination?
Do the AI tool terms of service allow for IP transfer in an acquisition?
Has the company obtained legal opinion on the ownership status of its AI-generated code?

The practical approach: treat AI code provenance as a risk factor, not a dealbreaker. Most startups using AI tools are in the same boat. But size the risk. If 90% of the codebase is AI-generated and nobody has done license scanning, that’s material. If 20% is Copilot-assisted with an experienced team reviewing every suggestion, the risk is manageable.

Include remediation cost for any license issues in your integration model. Rewriting contaminated code isn’t cheap, and it’s better to price it into the deal than discover it after closing.

Quantifying the Remediation Cost

Here’s where due diligence for AI-generated codebases gets concrete. You need a number. How much will it cost to bring this codebase to production-grade after acquisition?

Based on our code audit experience across dozens of AI-generated codebases, they typically need 30-60% more restructuring than traditional codebases of similar size and age. The range is wide because it depends heavily on whether a competent engineer was in the loop during development.

The remediation work falls into four categories:

Architecture refactoring. AI-built apps often lack proper abstraction layers. Business logic lives in UI components. Data access is scattered across files. There’s no service layer, no clear module boundaries, no separation of concerns. Adding these layers after the fact touches nearly every file in the codebase. This is the most expensive category and the one most likely to be underestimated.

Test rewriting. Not adding test coverage — rewriting what’s there. AI-generated tests look like real tests. They import the right modules, call the right functions, and assert… something. But often they’re testing implementation details instead of behavior. They pass when the code is wrong and break when someone refactors correctly. Meaningful test coverage usually means throwing out the AI-generated tests and starting fresh on the critical paths.

Security hardening. Hardcoded secrets, SQL injection vectors, missing input validation, overly permissive CORS configurations, authentication bypasses in edge cases. AI code is particularly weak on security because security requires understanding intent, not just pattern matching. Budget for a thorough security review and remediation pass.

Dependency cleanup. AI tools pull in packages liberally. You’ll find dependencies that are used once, dependencies with known vulnerabilities, dependencies that duplicate functionality already in the codebase, and dependencies that are abandoned or unmaintained. Pruning and replacing these is tedious but necessary.

The “thin layer” problem deserves special attention. AI-built apps tend to have tight coupling between every layer. The UI component directly calls the database. The API route directly manipulates the DOM state. There’s no abstraction boundary anywhere. Adding proper layers means touching every file, and the risk of introducing regressions is high without good test coverage (which, as noted, usually doesn’t exist).

Rule of thumb: For a vibe-coded MVP with 50,000 lines of code, budget 2-4 engineering months for production-readiness. For a larger application (100K+ lines) built primarily with AI tools and no senior engineering oversight, you’re looking at 4-8 months. These are vibe code cleanup engagements, not rewrites — the goal is to fix what’s there, not start over.

What to include in your integration cost model:

Engineering hours for architecture refactoring (the big one)
Security audit and remediation
Test suite rewrite for critical paths
Dependency audit and cleanup
DevOps and deployment pipeline work (many AI-built apps have no proper CI/CD)
Ongoing maintenance overhead during the transition (the codebase is harder to work in during restructuring)

How remediation cost should inform your offer price: treat it like deferred capex. If the codebase needs $200K of engineering work to reach production-grade, that’s $200K off the valuation. Don’t absorb it as a post-close surprise. Price it in.

A Due Diligence Framework for AI-Built Startups

Here’s the framework we use when running technical due diligence on AI-generated codebases. Five steps, each building on the last.

Step 1: Identify the AI footprint.

Before you can assess risk, you need to know what you’re looking at. What percentage of the codebase is AI-generated? Which tools were used? When did AI-assisted development start?

Ask the founding team directly. Most will be transparent about it — it’s increasingly seen as a positive, not a negative. Cross-reference with commit history patterns. AI-generated code often has distinct commit patterns: large commits with many files changed, generic commit messages, no PR review history.

Tools like GitHub’s Copilot metrics (if they use GitHub) can give you quantitative data. For Bolt.new or Lovable-built apps, the answer is usually “nearly everything.”

Step 2: Evaluate the human oversight layer.

This is the most important signal. Was there a senior engineer reviewing AI output, or was it prompt-and-ship?

A codebase where an experienced developer used AI as an accelerator is fundamentally different from one where a non-technical founder used AI as a replacement for engineering. The first has judgment embedded in it. The second has prompts embedded in it.

Look for: code review history in PRs, architectural decision records, refactoring commits that consolidate AI output, custom abstractions that weren’t generated. These are signs of human oversight. Their absence is a red flag.

Step 3: Test the edges.

This is where AI codebases reveal themselves. Don’t test the happy path — the demo already proved that works. Test the unhappy paths.

What happens when the API returns a 500? What happens when a user submits an empty form? What happens with concurrent requests to the same resource? What happens when the database connection drops? What happens with Unicode input, extremely long strings, or negative numbers where positives are expected?

AI code is optimized for the happy path because that’s what gets demonstrated. Edge case handling is where you find the real quality of the engineering.

Step 4: Assess the team’s understanding.

Can the founding team explain why the code is structured the way it is? Not just what it does — why it’s built that way. Can they trace a request through the system? Can they explain the data model choices?

If the answers are “that’s what the AI generated” or “I’m not sure why it does it that way,” you’re looking at a codebase that nobody fully understands. That’s a significant integration risk. The team can’t maintain what they don’t understand, and neither can yours.

Step 5: Size the remediation.

Take the findings from steps 1-4 and put a number on it. What would it cost to bring this codebase to your engineering standards? What’s the timeline? What’s the risk during the transition?

This isn’t hypothetical — it’s a line item in your integration budget. Get a technical team that’s experienced with AI-generated codebases to estimate it. General engineering estimates will be wrong because they won’t account for the AI-specific patterns that inflate restructuring costs.

Making Better Decisions on AI-Built Targets

AI-generated code isn’t inherently bad. We build with AI tools every day. The issue isn’t that AI wrote the code — it’s that traditional due diligence wasn’t designed to evaluate what AI produces.

The startups in your pipeline are increasingly AI-built. That’s not going to change. What needs to change is how you evaluate them. Surface-level technical DD that checks test coverage and dependency versions will miss the structural problems that turn a promising acquisition into an expensive integration project.

The framework above gives your deal team the questions to ask and the patterns to look for. But the technical evaluation itself requires people who’ve been inside these codebases — who know what Cursor-generated code looks like versus Lovable-generated code, and what each pattern costs to fix.

The firms that build this competency into their due diligence process will make better acquisition decisions. The ones that don’t will keep discovering six-figure remediation costs three months after closing.

Need to evaluate an AI-built startup before investing? Variant Systems runs technical due diligence specifically designed for vibe-coded and AI-generated codebases. We know what to look for because we fix these codebases every day.