What No one Is Measuring About Vibe Coding
We track AI adoption rate. We rarely track AI code quality rate.
Across the engineering organizations we work with, the average AI production-merge rate sits around 18%. That means roughly 82% of AI-generated suggestions get rejected, modified, or never make it to production. The number sounds healthy. It might not be. The question nobody is asking is: what is the selection logic that determines which 18% ships?
In some teams, it is rigorous review catching the security gaps and logic errors that AI consistently produces. In others, it is time pressure, rubber-stamping, and the quiet assumption that 'the AI wouldn't generate something broken.'
That assumption has a documented track record. It is not a good one.
The Incidents Are Documented. They Follow a Pattern.
In July 2025, Tea, a women's dating safety app, exposed 72,000 images and 1.1 million private messages via an open Firebase bucket. The database was not hacked. Nobody exploited a sophisticated vulnerability. The Firebase instance had no authorization policies configured. The code was functional. The security was just missing.
In May 2025, CVE-2025-48757 was filed against Lovable, a vibe coding platform with 170+ production applications affected. Missing Row Level Security on Supabase tables. Full database exposure. User data, authentication info, and business data accessible to anyone with the public key.
In late 2025, Enrichlead shut down entirely after an AI-generated codebase put all security logic on the client side. Not a subtle bug. A structural failure that went undetected until it was too late to fix at reasonable cost.
The pattern across every documented incident is the same: functionally correct code with security layers that were never implemented, because the AI was never prompted to implement them and nobody in the review process noticed they were absent.
The code looks right. It just doesn't have the security controls a senior developer would include by default.
Veracode's 2025 GenAI Code Security Report tested over 100 LLMs across four languages and found that 45% of AI-generated code failed OWASP Top 10 security benchmarks. Apiiro tracked a 322% increase in privilege escalation paths in AI-heavy codebases across Fortune 50 enterprises. By June 2025, AI-generated code was introducing over 10,000 new security findings per month in the organizations Apiiro tracked. That's a 10x increase from December 2024.
The speed gains are real. The security debt is also real.
Why AI Code Fails in Ways Human Code Doesn't
AI models generate code that satisfies the stated requirement. They don't generate code that satisfies the unstated assumptions a senior developer carries into every line they write.
When an experienced engineer builds an authentication endpoint, they apply a dozen implicit rules they have learned from incidents, code reviews, and security training. The AI doesn't have that context. It has patterns from training data, and security-critical patterns are statistically rare in public codebases.
This creates a specific failure profile. CodeRabbit's December 2025 analysis of AI versus human pull requests found that AI-generated code was:
- 2.74x more likely to introduce XSS vulnerabilities
- 1.91x more likely to create insecure object references (IDOR)
- 1.88x more likely to mishandle passwords
- 1.82x more likely to implement insecure deserialization
The vulnerability classes that increased most sharply were not the obvious logic errors. Syntax errors in AI-generated code actually dropped 76%. The flaws that grew were the dangerous architectural ones that look correct at the function level but expose the system at the access control layer.
Nearly 80% of developers believe AI tools generate more secure code than humans write, according to Snyk's research. The empirical data says the opposite, consistently, across every systematic study reviewed in 2025-2026.
Confidence without calibration is how incidents happen.
The Six Metrics Every Engineering Leader Needs Before AI Code Hits Production
Your existing quality metrics were not designed for AI-generated code. Cycle time looks fine. Velocity looks fine. PR count looks fine. None of these tell you whether the code that shipped was actually reviewed or whether the reviewer spent four minutes clicking approve because the diff looked clean and the tests passed.
You need a measurement layer that is specifically designed for the AI code era. Here is what to track.
A few notes on how to use this table in practice.
The AI production-merge rate is the most counterintuitive of these. A high number is not necessarily good. If your merge rate is climbing from 18% to 35%, the question is why. Is it because code quality improved and reviewers are correctly approving more? Or is it because review standards have quietly dropped as engineers normalized AI-generated output?
One 500-person engineering organization with a $2M annual AI tooling investment found that only 12% of AI-generated code was making it to production with meaningful quality signals attached. The other 88% was being generated, discarded, or silently skipped past review. After six weeks of measurement with Hivel, the team had visibility into exactly where the AI code was going and where the rework was accumulating.
How to Build the Governance Layer Without Killing the Speed
The response to documented AI code security failures should not be to ban vibe coding. The speed gains are real. Teams that have adopted AI coding tools well are shipping 4x more features per sprint cycle than teams that haven't. You don't want to give that back.
The right response is to build a governance layer that is proportional to the risk level of what the AI is generating.
Low-risk AI code: CRUD, UI, data transforms
Standard code review, SAST on every PR, no additional gates. AI can be trusted here with normal oversight. This covers the majority of your codebase.
High-risk AI code: auth, access control, payment logic, secrets management
Mandatory human review before merge, regardless of AI tool confidence score. Threat modeling before any auth change lands in main. Automated checks for headers, CSRF tokens, and OWASP Top 10 basics running in CI. No exceptions for urgency.
The Tenzai analysis of 15 production vibe-coded applications found that every single one was missing CSRF protection and security headers, and every single tool introduced SSRF vulnerabilities. These are not edge cases. They are defaults.
Enforcement, not policy
Policy without enforcement is wishful thinking. Every governance rule needs a corresponding CI gate. If auth code cannot merge without a designated security reviewer, that rule needs to be enforced by branch protection, not by memory.
SonarQube for code complexity and OWASP patterns. Semgrep for custom rules your team defines around your own security requirements. Trufflehog or GitHub Advanced Security for secret scanning. These run on every PR, automatically, without requiring a human to remember to check.
The teams that are getting AI code right in 2026 are not reviewing AI code more slowly. They are reviewing it more specifically, with tooling that flags the patterns AI consistently gets wrong, so human reviewers can focus on the architectural decisions that tools can't catch.
Frequently Asked Questions
What is vibe coding quality?
Vibe coding quality refers to the security, correctness, and maintainability of code produced through natural language AI prompts, where developers describe intent and accept AI-generated implementation without writing each line themselves. Vibe-coded applications consistently show higher rates of specific vulnerability classes, particularly missing access controls, hardcoded credentials, and insecure object references, because AI generates functionally correct code without the implicit security assumptions experienced developers apply by default.
What are the main AI code security risks?
The most consistently documented AI code security risks include: XSS vulnerabilities (2.74x more likely in AI PRs than human PRs), improper password handling (1.88x more likely), insecure object references (1.91x), missing security headers and CSRF protection (found in 100% of tested vibe-coded apps in one December 2025 study), and hardcoded credentials. The underlying cause is the same across all categories: AI models optimize for functional output and lack the security context that senior developers internalize from experience.
How do you measure AI code quality in an engineering organization?
Start with attribution: you need to know which code is AI-generated at the PR level before you can measure its quality. Then track six metrics: AI production-merge rate, security finding rate per AI PR vs human PR, rework rate on AI-sourced code, credential exposure rate, post-merge bug escape rate by code origin, and review time per AI PR vs human PR. Rising rework rate combined with rising merge rate is the clearest early warning signal that review standards have dropped.
What percentage of AI-generated code contains security vulnerabilities?
Research findings range from 40% to 62% depending on methodology and tool tested. Veracode's 2025 GenAI Code Security Report, covering 100+ LLMs across four languages, found 45% failure rate against OWASP Top 10. CSA and Endor Labs found 62% of AI-generated code containing design flaws or known vulnerabilities. Escape.tech scanned 1,400 vibe-coded production applications and found 65% had security issues, with 58% containing at least one critical vulnerability.
Can you use AI coding tools safely in enterprise environments?
Yes, with proportional governance. Low-risk AI code (CRUD operations, UI components, data transformations) can use standard review processes with automated SAST. High-risk AI code (authentication, authorization, payment logic, secrets management) requires mandatory human security review, automated header and CSRF checks in CI, and threat modeling before changes merge. The teams getting this right in 2026 are not banning AI tools. They are attributing AI code at the PR level, measuring it separately, and enforcing risk-proportional review gates automatically.



