Code Generation Validation
Verify syntax, logic, and security when using LLMs to generate code.
The Stakes of Code Generation
LLMs can write code quickly, but that code can contain bugs, security vulnerabilities, or logic errors that break production systems. Unlike text generation, code must be functionally correct, not just plausible-sounding.
What to Evaluate
1. Syntax Correctness
Does the code parse without errors?
2. Functional Correctness
Does the code produce the expected output?
Task: "Write a function to reverse a string"
generatedCode("hello") → "olleh" ✓
generatedCode("") → "" ✓
generatedCode("a") → "a" ✓
3. Edge Case Handling
Does it handle null, empty inputs, large values, etc.?
- Null/undefined inputs
- Empty arrays or strings
- Very large numbers
- Negative numbers where unexpected
- Special characters and unicode
4. Security
Is the code vulnerable to common attacks?
- SQL injection: Use parameterized queries
- XSS: Sanitize user inputs
- Path traversal: Validate file paths
- eval() usage: Avoid dynamic code execution
- Secrets in code: No hardcoded API keys
5. Code Quality
- Readability: Clear variable names, comments
- Efficiency: Appropriate time/space complexity
- Best practices: Follows language conventions
- Maintainability: Modular, not overly complex
Evaluation Framework
Step 1: Syntax Validation
Run language-specific parsers/compilers:
Step 2: Unit Test Execution
Create comprehensive test suites for generated code:
Step 3: Security Scanning
Use static analysis tools to detect vulnerabilities:
- Bandit (Python security linter)
- ESLint with security plugins (JavaScript)
- Semgrep (multi-language security scanner)
- CodeQL (GitHub's semantic code analysis)
Step 4: Code Quality Analysis
Step 5: LLM-as-Judge for Logic
Use another LLM to review code logic and design:
Judge Prompt:
"Review this code for logical correctness, efficiency, and edge case handling. Score 1-5 on: (1) Correctness: Does the logic match the requirements? (2) Efficiency: Is the time complexity reasonable? (3) Robustness: Does it handle edge cases? Provide specific issues if score < 4."
Common Failure Modes
1. Off-by-One Errors
Task: "Get last element of array"
❌ arr[arr.length]
✓ arr[arr.length - 1]
2. Type Coercion Bugs
3. Missing Null Checks
4. Inefficient Algorithms
Task: "Check if array contains duplicate"
❌ O(n²) nested loops
✓ O(n) using Set
Test Suite Design
Coverage Matrix
Ensure your test suite covers all scenarios:
| Category | Example Tests |
|---|---|
| Happy path | Normal inputs, expected output |
| Edge cases | Empty, null, single element |
| Boundaries | Max int, min int, array limits |
| Error cases | Invalid types, out of bounds |
| Performance | Large inputs, time limits |
Example: Comprehensive Test Suite
Best Practices
1. Sandbox Execution
Always run generated code in an isolated environment:
- Use Docker containers for isolation
- Set CPU and memory limits
- Implement execution timeouts
- Restrict file system and network access
2. Provide Clear Specifications
Give the LLM detailed requirements:
Vague: "Write a sort function"
Clear: "Write a function sort(arr) that sorts an array of integers in ascending order. Should handle empty arrays and duplicates. Use O(n log n) algorithm."
3. Iterative Refinement
If tests fail, provide feedback and regenerate:
4. Version Control Generated Code
- Track what prompts generated what code
- Monitor quality trends over time
- Identify which types of tasks fail most
Production Deployment
Gradual Rollout
- Shadow mode: Generate code but don't execute (compare to human-written)
- Limited pilot: Use for low-risk tasks only
- Human review: Require approval before deployment
- Monitoring: Track errors and rollback if needed
Continuous Validation
Re-run tests periodically on production code:
- Regression tests after model updates
- Monitor runtime errors and crashes
- A/B test generated vs. hand-written code
Real-World Example
SQL Query Generation for Analytics
Use Case: Convert natural language to SQL
Evaluation:
- Syntax validation with SQL parser
- Execute against test database
- Compare results to expected output
- Security scan for injection vulnerabilities
- LLM judge reviews query logic
Results:
- 95% syntax correctness
- 87% functional correctness
- 100% passed security scans
- Reduced query writing time by 60%