Background Job Runner
DB-backed job queue built on PostgreSQL for reliable async processing
Jobs are stored in the jobs table and processed byrunDueJobs(), invoked every minute via Vercel Cron atPOST /api/jobs/run.
System Invariants
These are the rules the job system must always obey:
1. Idempotency is enforced
At both enqueue-time and handler-time:
- • Enqueue uses
INSERT … ON CONFLICT DO NOTHINGon idempotency key - • Webhook handler checks existing successful deliveries before sending
2. Jobs never get "stuck"
- • Worker crashes mid-run → job reclaimable after
locked_untilexpires (2 min TTL) - • Reclaimed jobs tagged with
JOB_LOCK_TIMEOUT_RECLAIMEDfor traceability
3. Retries are deterministic
- • Backoff math is stable (exponential with ±10% jitter)
- •
next_run_atstored and queryable for every pending retry - •
attempt >= maxAttempts→dead_letter
4. Observability is not optional
- • Every attempt records timing, duration, error codes
- • Structured logs on enqueue, claim, success, failure, reclaim
- • DLQ searchable by org + type + error code + time window
5. Payloads are validated
- • Max 128 KB serialized size, 10 levels depth, 500 keys
- • Zod schema validation per job type (optional skip)
Acceptance criteria: You can answer "What failed? why? how often? is it stuck? who owns it?" in under 30 seconds.
Status Lifecycle
enqueue() (atomic idempotency via ON CONFLICT DO NOTHING)
│
▼
pending ──────────────────────────────────────────────────────────────────┐
│ │
│ optimistic claim (UPDATE WHERE status='pending' │
│ OR (status='running' AND locked_until <= now)) │
▼ │
running ── handler throws, attempt < maxAttempts ──► pending (backoff) │
│ │ │
│ ├── 429 → RATE_LIMITED (Retry-After) │
│ ├── 5xx → UPSTREAM_5XX │
│ └── other → HANDLER_ERROR │
│ │
│ handler throws, attempt >= maxAttempts │
├──────────────────────────────────────────────────────────────────────► dead_letter
│ │
│ payload invalid (Zod) at runner time │
├──────────────────────────────────────────────────────────────────────► dead_letter
│ │
│ no handler registered │
├──────────────────────────────────────────────────────────────────────► dead_letter
│ │
│ handler succeeds │
▼ │
success │
│
POST /api/jobs/:id/retry ◄──────────────────────────────────────────────────┘
│ mode: now | later | resetTerminal States
success, dead_letterRetriable States
pending, running (via TTL reclaim)Enqueue Safety
Atomic Idempotency
INSERT INTO jobs (...) VALUES (...) ON CONFLICT (idempotency_key) DO NOTHING RETURNING id
If 0 rows returned → conflict → SELECT existing job ID. Two concurrent enqueue() calls with the same idempotencyKey return the same job ID with zero duplicates.
Payload Validation
Before insertion, enqueue() validates:
Size: serialized JSON ≤ 128 KB
Depth: max 10 levels of nesting
Keys: max 500 total keys
Schema: Zod validation per job type (skippable via
skipValidation: true)Violations throw EnqueueError with code PAYLOAD_TOO_LARGE or PAYLOAD_INVALID.
Common Job Types
Webhook Delivery
- • HTTP POST to external endpoints
- • Retry on 429/5xx with exponential backoff
- • Signature verification for security
- • Delivery tracking and receipts
Report Generation
- • PDF/CSV export creation
- • Large dataset processing
- • Template rendering
- • Email delivery
Data Cleanup
- • Retention policy enforcement
- • Cache invalidation
- • Archive old records
- • GDPR compliance tasks
Notifications
- • Email notifications
- • Slack/Discord integrations
- • SMS alerts
- • Push notifications
Error Handling & Retries
Retry Strategy
// Exponential backoff with jitter delay = baseDelay * (2 ^ attempt) * (0.9 + Math.random() * 0.2) // Max attempts by job type webhook: 5 attempts (1s, 2s, 4s, 8s, 16s) report: 3 attempts (30s, 60s, 120s) cleanup: 2 attempts (5m, 15m)
Error Classification
Retryable Errors
- •
RATE_LIMITED- HTTP 429 - •
UPSTREAM_5XX- Server errors - •
TIMEOUT- Network timeouts - •
NETWORK_ERROR- Connection failures
Non-Retryable Errors
- •
HANDLER_ERROR- Logic errors - •
PAYLOAD_INVALID- Validation failures - •
AUTHENTICATION_FAILED- Auth errors - •
NOT_FOUND- Missing resources
Monitoring & Observability
Metrics
- • Queue depth by job type
- • Processing latency (P50, P95, P99)
- • Success/failure rates
- • Retry attempt distributions
Alerting
- • DLQ size thresholds
- • High retry rates
- • Stuck job detection
- • Processing latency spikes
Logs
- • Structured JSON logs
- • Job lifecycle events
- • Error stack traces
- • Performance timing
Debugging
- • Job inspection UI
- • Manual retry controls
- • Payload preview
- • Execution history
API Endpoints
POST /api/jobs/enqueue
{
"type": "webhook_delivery",
"payload": { "url": "...", "data": {...} },
"idempotencyKey": "unique-key",
"runAt": "2024-01-15T10:30:00Z", // optional
"priority": "normal" // low | normal | high
}
POST /api/jobs/run
// Cron endpoint - processes due jobs
GET /api/jobs/:id
// Job status and metadata
POST /api/jobs/:id/retry
{
"mode": "now" | "later" | "reset"
}