EvalGate

Background Job Runner

DB-backed job queue built on PostgreSQL for reliable async processing

Jobs are stored in the jobs table and processed byrunDueJobs(), invoked every minute via Vercel Cron atPOST /api/jobs/run.

System Invariants

These are the rules the job system must always obey:

1. Idempotency is enforced

At both enqueue-time and handler-time:

• Enqueue uses INSERT … ON CONFLICT DO NOTHING on idempotency key
• Webhook handler checks existing successful deliveries before sending

2. Jobs never get "stuck"

• Worker crashes mid-run → job reclaimable after locked_until expires (2 min TTL)
• Reclaimed jobs tagged with JOB_LOCK_TIMEOUT_RECLAIMED for traceability

3. Retries are deterministic

• Backoff math is stable (exponential with ±10% jitter)
• next_run_at stored and queryable for every pending retry
• attempt >= maxAttempts → dead_letter

4. Observability is not optional

• Every attempt records timing, duration, error codes
• Structured logs on enqueue, claim, success, failure, reclaim
• DLQ searchable by org + type + error code + time window

5. Payloads are validated

• Max 128 KB serialized size, 10 levels depth, 500 keys
• Zod schema validation per job type (optional skip)

Acceptance criteria: You can answer "What failed? why? how often? is it stuck? who owns it?" in under 30 seconds.

Status Lifecycle

enqueue()  (atomic idempotency via ON CONFLICT DO NOTHING)
    │
    ▼
 pending  ──────────────────────────────────────────────────────────────────┐
    │                                                                        │
    │  optimistic claim (UPDATE WHERE status='pending'                       │
    │    OR (status='running' AND locked_until <= now))                      │
    ▼                                                                        │
 running  ── handler throws, attempt < maxAttempts ──► pending (backoff)    │
    │              │                                                         │
    │              ├── 429 → RATE_LIMITED (Retry-After)                      │
    │              ├── 5xx → UPSTREAM_5XX                                    │
    │              └── other → HANDLER_ERROR                                 │
    │                                                                        │
    │  handler throws, attempt >= maxAttempts                                │
    ├──────────────────────────────────────────────────────────────────────► dead_letter
    │                                                                        │
    │  payload invalid (Zod) at runner time                                  │
    ├──────────────────────────────────────────────────────────────────────► dead_letter
    │                                                                        │
    │  no handler registered                                                 │
    ├──────────────────────────────────────────────────────────────────────► dead_letter
    │                                                                        │
    │  handler succeeds                                                      │
    ▼                                                                        │
 success                                                                     │
                                                                             │
 POST /api/jobs/:id/retry ◄──────────────────────────────────────────────────┘
    │  mode: now | later | reset

Terminal States

success, dead_letter

Retriable States

pending, running (via TTL reclaim)

Enqueue Safety

Atomic Idempotency

INSERT INTO jobs (...) VALUES (...)
ON CONFLICT (idempotency_key) DO NOTHING
RETURNING id

If 0 rows returned → conflict → SELECT existing job ID. Two concurrent enqueue() calls with the same idempotencyKey return the same job ID with zero duplicates.

Payload Validation

Before insertion, enqueue() validates:

Size: serialized JSON ≤ 128 KB

Depth: max 10 levels of nesting

Keys: max 500 total keys

Schema: Zod validation per job type (skippable via skipValidation: true)

Violations throw EnqueueError with code PAYLOAD_TOO_LARGE or PAYLOAD_INVALID.

Common Job Types

Webhook Delivery

• HTTP POST to external endpoints
• Retry on 429/5xx with exponential backoff
• Signature verification for security
• Delivery tracking and receipts

Report Generation

• PDF/CSV export creation
• Large dataset processing
• Template rendering
• Email delivery

Data Cleanup

• Retention policy enforcement
• Cache invalidation
• Archive old records
• GDPR compliance tasks

Notifications

• Email notifications
• Slack/Discord integrations
• SMS alerts
• Push notifications

Error Handling & Retries

Retry Strategy

// Exponential backoff with jitter
delay = baseDelay * (2 ^ attempt) * (0.9 + Math.random() * 0.2)

// Max attempts by job type
webhook: 5 attempts (1s, 2s, 4s, 8s, 16s)
report: 3 attempts (30s, 60s, 120s)
cleanup: 2 attempts (5m, 15m)

Error Classification

Retryable Errors

• RATE_LIMITED - HTTP 429
• UPSTREAM_5XX - Server errors
• TIMEOUT - Network timeouts
• NETWORK_ERROR - Connection failures

Non-Retryable Errors

• HANDLER_ERROR - Logic errors
• PAYLOAD_INVALID - Validation failures
• AUTHENTICATION_FAILED - Auth errors
• NOT_FOUND - Missing resources

Monitoring & Observability

Metrics

• Queue depth by job type
• Processing latency (P50, P95, P99)
• Success/failure rates
• Retry attempt distributions

Alerting

• DLQ size thresholds
• High retry rates
• Stuck job detection
• Processing latency spikes

Logs

• Structured JSON logs
• Job lifecycle events
• Error stack traces
• Performance timing

Debugging

• Job inspection UI
• Manual retry controls
• Payload preview
• Execution history

API Endpoints

POST /api/jobs/enqueue
{
  "type": "webhook_delivery",
  "payload": { "url": "...", "data": {...} },
  "idempotencyKey": "unique-key",
  "runAt": "2024-01-15T10:30:00Z",  // optional
  "priority": "normal"  // low | normal | high
}

POST /api/jobs/run
// Cron endpoint - processes due jobs

GET /api/jobs/:id
// Job status and metadata

POST /api/jobs/:id/retry
{
  "mode": "now" | "later" | "reset"
}

Learn More

API Contract System Stability View Job Queue