EvalGate

The Stakes of Code Generation

LLMs can write code quickly, but that code can contain bugs, security vulnerabilities, or logic errors that break production systems. Unlike text generation, code must be functionally correct, not just plausible-sounding.

What to Evaluate

1. Syntax Correctness

Does the code parse without errors?

// Automated check try { eval(generatedCode); // or language-specific parser console.log("✓ Syntax valid"); } catch (error) { console.error("✗ Syntax error:", error.message); assert(false); }

2. Functional Correctness

Does the code produce the expected output?

Task: "Write a function to reverse a string"

generatedCode("hello") → "olleh" ✓

generatedCode("") → "" ✓

generatedCode("a") → "a" ✓

3. Edge Case Handling

Does it handle null, empty inputs, large values, etc.?

Null/undefined inputs
Empty arrays or strings
Very large numbers
Negative numbers where unexpected
Special characters and unicode

4. Security

Is the code vulnerable to common attacks?

SQL injection: Use parameterized queries
XSS: Sanitize user inputs
Path traversal: Validate file paths
eval() usage: Avoid dynamic code execution
Secrets in code: No hardcoded API keys

5. Code Quality

Readability: Clear variable names, comments
Efficiency: Appropriate time/space complexity
Best practices: Follows language conventions
Maintainability: Modular, not overly complex

Evaluation Framework

Step 1: Syntax Validation

Run language-specific parsers/compilers:

// Python import ast try: ast.parse(generated_code) print("✓ Valid Python syntax") except SyntaxError as e: print(f"✗ Syntax error: {e}") // JavaScript const { parse } = require('@babel/parser'); try { parse(generatedCode); console.log("✓ Valid JS syntax"); } catch (error) { console.error("✗ Syntax error:", error.message); }

Step 2: Unit Test Execution

Create comprehensive test suites for generated code:

// Test case structure { "task": "Implement binary search", "tests": [ { "input": "([1,2,3,4,5], 3)", "expected": 2 }, { "input": "([1,3,5,7], 4)", "expected": -1 }, { "input": "([], 1)", "expected": -1 }, { "input": "([1], 1)", "expected": 0 } ] } // Automated test runner const results = tests.map(test => { const actual = eval(`generatedFunction${test.input}`); return actual === test.expected; }); const passRate = results.filter(Boolean).length / tests.length;

Step 3: Security Scanning

Use static analysis tools to detect vulnerabilities:

Bandit (Python security linter)
ESLint with security plugins (JavaScript)
Semgrep (multi-language security scanner)
CodeQL (GitHub's semantic code analysis)

// Example: Check for common anti-patterns const securityChecks = { hasEval: /eval$/.test(code), hasExec: /exec\(/.test(code), hardcodedSecrets: /api[_-]?key\s*=\s*["']/.test(code), sqlInjection: /execute\(.*\+.*$/.test(code) }; Object.entries(securityChecks).forEach(([check, failed]) => { if (failed) console.error(`✗ Security issue: ${check}`); });

Step 4: Code Quality Analysis

// Check code complexity const complexity = calculateCyclomaticComplexity(code); assert(complexity < 10); // Reasonable threshold // Check naming conventions const variableNames = extractVariableNames(code); const hasDescriptiveNames = variableNames.every(name => name.length > 2 && !/^[a-z]$/.test(name) ); assert(hasDescriptiveNames); // Check for comments const commentRatio = countComments(code) / countLines(code); assert(commentRatio >= 0.1); // 10% comments

Step 5: LLM-as-Judge for Logic

Use another LLM to review code logic and design:

Judge Prompt:

"Review this code for logical correctness, efficiency, and edge case handling. Score 1-5 on: (1) Correctness: Does the logic match the requirements? (2) Efficiency: Is the time complexity reasonable? (3) Robustness: Does it handle edge cases? Provide specific issues if score < 4."

Common Failure Modes

1. Off-by-One Errors

Task: "Get last element of array"

❌ arr[arr.length]

✓ arr[arr.length - 1]

2. Type Coercion Bugs

// JavaScript "2" + 2 === "22" // String concatenation, not addition 2 + "2" === "22" // Same issue // Need explicit type conversion parseInt("2") + 2 === 4

3. Missing Null Checks

// Unsafe function getUsername(user) { return user.name.toUpperCase(); // Crashes if user is null } // Safe function getUsername(user) { return user?.name?.toUpperCase() ?? "Unknown"; }

4. Inefficient Algorithms

Task: "Check if array contains duplicate"

❌ O(n²) nested loops

✓ O(n) using Set

Test Suite Design

Coverage Matrix

Ensure your test suite covers all scenarios:

Category	Example Tests
Happy path	Normal inputs, expected output
Edge cases	Empty, null, single element
Boundaries	Max int, min int, array limits
Error cases	Invalid types, out of bounds
Performance	Large inputs, time limits

Example: Comprehensive Test Suite

{ "task": "Implement fibonacci(n)", "tests": [ // Happy path { "input": 0, "expected": 0 }, { "input": 1, "expected": 1 }, { "input": 5, "expected": 5 }, { "input": 10, "expected": 55 }, // Edge cases { "input": 2, "expected": 1 }, // Boundaries { "input": 50, "expected": 12586269025, "timeout": 1000 }, // Error cases { "input": -1, "shouldThrow": true }, { "input": 1.5, "shouldThrow": true }, { "input": null, "shouldThrow": true } ] }

Best Practices

1. Sandbox Execution

Always run generated code in an isolated environment:

Use Docker containers for isolation
Set CPU and memory limits
Implement execution timeouts
Restrict file system and network access

2. Provide Clear Specifications

Give the LLM detailed requirements:

Vague: "Write a sort function"

Clear: "Write a function sort(arr) that sorts an array of integers in ascending order. Should handle empty arrays and duplicates. Use O(n log n) algorithm."

3. Iterative Refinement

If tests fail, provide feedback and regenerate:

// Initial generation fails test const failedTests = runTests(generatedCode); if (failedTests.length > 0) { const feedback = `These tests failed: ${failedTests}. Please fix the logic errors and regenerate.`; // Regenerate with feedback const revisedCode = await llm.generate({ prompt: originalTask, feedback: feedback }); }

4. Version Control Generated Code

Track what prompts generated what code
Monitor quality trends over time
Identify which types of tasks fail most

Production Deployment

Gradual Rollout

Shadow mode: Generate code but don't execute (compare to human-written)
Limited pilot: Use for low-risk tasks only
Human review: Require approval before deployment
Monitoring: Track errors and rollback if needed

Continuous Validation

Re-run tests periodically on production code:

Regression tests after model updates
Monitor runtime errors and crashes
A/B test generated vs. hand-written code

Real-World Example

SQL Query Generation for Analytics

Use Case: Convert natural language to SQL

Evaluation:

Syntax validation with SQL parser
Execute against test database
Compare results to expected output
Security scan for injection vulnerabilities
LLM judge reviews query logic

Results:

95% syntax correctness
87% functional correctness
100% passed security scans
Reduced query writing time by 60%

The Stakes of Code Generation

What to Evaluate

1. Syntax Correctness

Does the code parse without errors?

// Automated check try { eval(generatedCode); // or language-specific parser console.log("✓ Syntax valid"); } catch (error) { console.error("✗ Syntax error:", error.message); assert(false); }

2. Functional Correctness

Does the code produce the expected output?

Task: "Write a function to reverse a string"

generatedCode("hello") → "olleh" ✓

generatedCode("") → "" ✓

generatedCode("a") → "a" ✓

3. Edge Case Handling

Does it handle null, empty inputs, large values, etc.?

Null/undefined inputs
Empty arrays or strings
Very large numbers
Negative numbers where unexpected
Special characters and unicode

4. Security

Is the code vulnerable to common attacks?

SQL injection: Use parameterized queries
XSS: Sanitize user inputs
Path traversal: Validate file paths
eval() usage: Avoid dynamic code execution
Secrets in code: No hardcoded API keys

5. Code Quality

Readability: Clear variable names, comments
Efficiency: Appropriate time/space complexity
Best practices: Follows language conventions
Maintainability: Modular, not overly complex

Evaluation Framework

Step 1: Syntax Validation

Run language-specific parsers/compilers:

Step 2: Unit Test Execution

Create comprehensive test suites for generated code:

Step 3: Security Scanning

Use static analysis tools to detect vulnerabilities:

Bandit (Python security linter)
ESLint with security plugins (JavaScript)
Semgrep (multi-language security scanner)
CodeQL (GitHub's semantic code analysis)

Step 4: Code Quality Analysis

Step 5: LLM-as-Judge for Logic

Use another LLM to review code logic and design:

Judge Prompt:

Common Failure Modes

1. Off-by-One Errors

Task: "Get last element of array"

❌ arr[arr.length]

✓ arr[arr.length - 1]

2. Type Coercion Bugs

// JavaScript "2" + 2 === "22" // String concatenation, not addition 2 + "2" === "22" // Same issue // Need explicit type conversion parseInt("2") + 2 === 4

3. Missing Null Checks

// Unsafe function getUsername(user) { return user.name.toUpperCase(); // Crashes if user is null } // Safe function getUsername(user) { return user?.name?.toUpperCase() ?? "Unknown"; }

4. Inefficient Algorithms

Task: "Check if array contains duplicate"

❌ O(n²) nested loops

✓ O(n) using Set

Test Suite Design

Coverage Matrix

Ensure your test suite covers all scenarios:

Category	Example Tests
Happy path	Normal inputs, expected output
Edge cases	Empty, null, single element
Boundaries	Max int, min int, array limits
Error cases	Invalid types, out of bounds
Performance	Large inputs, time limits

Example: Comprehensive Test Suite

Best Practices

1. Sandbox Execution

Always run generated code in an isolated environment:

Use Docker containers for isolation
Set CPU and memory limits
Implement execution timeouts
Restrict file system and network access

2. Provide Clear Specifications

Give the LLM detailed requirements:

Vague: "Write a sort function"

Clear: "Write a function sort(arr) that sorts an array of integers in ascending order. Should handle empty arrays and duplicates. Use O(n log n) algorithm."

3. Iterative Refinement

If tests fail, provide feedback and regenerate:

4. Version Control Generated Code

Track what prompts generated what code
Monitor quality trends over time
Identify which types of tasks fail most

Production Deployment

Gradual Rollout

Shadow mode: Generate code but don't execute (compare to human-written)
Limited pilot: Use for low-risk tasks only
Human review: Require approval before deployment
Monitoring: Track errors and rollback if needed

Continuous Validation

Re-run tests periodically on production code:

Regression tests after model updates
Monitor runtime errors and crashes
A/B test generated vs. hand-written code

Real-World Example

SQL Query Generation for Analytics

Use Case: Convert natural language to SQL

Evaluation:

Syntax validation with SQL parser
Execute against test database
Compare results to expected output
Security scan for injection vulnerabilities
LLM judge reviews query logic

Results:

95% syntax correctness
87% functional correctness
100% passed security scans
Reduced query writing time by 60%

Code Generation Validation

The Stakes of Code Generation

What to Evaluate

1. Syntax Correctness

2. Functional Correctness

3. Edge Case Handling

4. Security

5. Code Quality

Evaluation Framework

Step 1: Syntax Validation

Step 2: Unit Test Execution

Step 3: Security Scanning

Step 4: Code Quality Analysis

Step 5: LLM-as-Judge for Logic

Common Failure Modes

1. Off-by-One Errors

2. Type Coercion Bugs

3. Missing Null Checks

4. Inefficient Algorithms

Test Suite Design

Coverage Matrix

Example: Comprehensive Test Suite

Best Practices

1. Sandbox Execution

2. Provide Clear Specifications

3. Iterative Refinement

4. Version Control Generated Code

Production Deployment

Gradual Rollout

Continuous Validation

Real-World Example

SQL Query Generation for Analytics

Related Guides

Code Generation Validation

The Stakes of Code Generation

What to Evaluate

1. Syntax Correctness

2. Functional Correctness

3. Edge Case Handling

4. Security

5. Code Quality

Evaluation Framework

Step 1: Syntax Validation

Step 2: Unit Test Execution

Step 3: Security Scanning

Step 4: Code Quality Analysis

Step 5: LLM-as-Judge for Logic

Common Failure Modes

1. Off-by-One Errors

2. Type Coercion Bugs

3. Missing Null Checks

4. Inefficient Algorithms

Test Suite Design

Coverage Matrix

Example: Comprehensive Test Suite

Best Practices

1. Sandbox Execution

2. Provide Clear Specifications

3. Iterative Refinement

4. Version Control Generated Code

Production Deployment

Gradual Rollout

Continuous Validation

Real-World Example

SQL Query Generation for Analytics

Related Guides