Every data engineer has lived through it: a pipeline fires on schedule, the upstream data isn’t there yet, and you find out three hours later when the dashboard is empty and someone in Slack is asking why the numbers look wrong. The pipeline didn’t fail in the traditional sense — the orchestrator ran successfully, the job completed without errors. It just operated on incomplete or stale data and produced garbage.
The root cause isn’t a bug. It’s a missing safety constraint. There’s no gate between “it’s time to run” and “the inputs are actually ready.”
This idea — that accidents come from inadequate control constraints rather than simple component failures — is the core thesis of STAMP (Systems-Theoretic Accident Model and Processes), developed by Nancy Leveson at MIT. STAMP was designed for aerospace and nuclear systems, but the model applies directly to data pipelines. A batch job that runs before its upstream dependency completes isn’t a failed component. It’s a system without adequate feedback.
I built Interlock to be that feedback mechanism.
What Interlock Does
Interlock is a readiness framework. It sits between your scheduler and your pipeline execution, evaluating a set of traits before allowing a trigger. If the traits don’t pass, the pipeline doesn’t fire.
The system works at three levels:
- Trait evaluation — individual checks like “source data updated within 5 minutes” or “upstream pipeline completed”
- Readiness rule — aggregation logic (e.g., all required traits must pass)
- SLA enforcement — alerts if evaluation or completion deadlines are breached
Evaluator Protocol
Evaluators are external programs that receive JSON on stdin and return JSON on stdout. They can be written in any language — shell scripts, Python, Go, whatever your team already uses:
| |
This keeps Interlock decoupled from your data stack. The framework doesn’t need to know how to query your data catalog or check S3 timestamps. Evaluators do.
Pipeline Configuration
A pipeline declares which archetype it follows, overrides trait-specific configuration, and specifies how to trigger:
| |
Archetypes
Archetypes define reusable trait templates. Instead of repeating the same trait definitions across 50 pipelines, you define the pattern once:
| |
Every pipeline referencing batch-ingestion inherits these traits and can override evaluator paths, config values, or timeouts. Archetypes also support optionalTraits for non-blocking checks like schema contracts or data quality sampling.
Run State Machine
Once a pipeline is triggered, Interlock tracks it through a state machine:
PENDING ──> TRIGGERING ──> RUNNING ──> COMPLETED
│ │
│ ├──> COMPLETED_MONITORING ──> COMPLETED
│ │
└──> FAILED └──> FAILED
State transitions use compare-and-swap (CAS) — every RunState carries a version number, and updates only succeed if the version matches. This prevents concurrent evaluations from clobbering each other.
Architecture: Local Mode
The local deployment uses Redis for state and Postgres for long-term archival. Everything runs in Docker Compose:
┌─────────────────────────────────────────┐
│ Docker Compose Stack │
│ │
│ Watcher ──poll──> Redis (state/CAS) │
│ │ │
│ ├──eval──> Evaluator scripts │
│ └──trigger──> HTTP / Command │
│ │
│ Archiver ──30s──> Postgres (archival) │
│ Grafana :3001 ──> Interlock API :3000 │
└─────────────────────────────────────────┘
The Watcher Loop
The watcher polls every 15 seconds (configurable). On each tick, for each pipeline and schedule:
- Acquire lock —
eval:{pipeline}:{scheduleID}via RedisSetNX - Check run log — skip if already completed or in backoff
- Evaluate readiness — invoke evaluator programs, collect results
- Check SLA — alert if evaluation deadline breached
- Trigger — if all required traits pass, execute the trigger
- Track execution — poll for completion or mark done
- Release lock
CAS operations in Redis use a Lua script for atomicity — read the current version, compare, set the new value, all in a single atomic operation. The archiver runs on a separate 30-second loop, copying completed runs and events from Redis into Postgres.
Provider Interface
Storage is abstracted behind a Provider interface with 40+ methods across 10 sub-interfaces: PipelineStore, TraitStore, RunStore, EventStore, RunLogStore, RerunStore, CascadeStore, LateArrivalStore, ReplayStore, and Locker. Both Redis and DynamoDB implement the full interface. A shared providertest package runs 15 contract tests against both backends, including a CAS race condition test that spins up 10 goroutines contending on the same run state.
Architecture: AWS Mode
The AWS variant replaces the polling watcher with an event-driven architecture. DynamoDB Streams notify downstream Lambdas when state changes, and a Step Function orchestrates the evaluation lifecycle.
DynamoDB ──stream──> stream-router ──StartExecution──> Step Function (47 states)
^ │
│ CAS writes ┌────────────────────────────────────┤
│ │ │ │
│ orchestrator evaluator trigger / run-checker
│ Lambda Lambda Lambda
└──────────────────┘ │
│ HTTP POST
Evaluator API
DynamoDB Single-Table Design
Everything lives in one table with PK/SK keys and a GSI1 for cross-pipeline queries:
| Entity | PK | SK |
|---|---|---|
| Pipeline config | PIPELINE#{id} | CONFIG |
| Run state | RUN#{runID} | STATE |
| Run log | PIPELINE#{id} | RUNLOG#{date}#{scheduleID} |
| Trait result | PIPELINE#{id} | TRAIT#{type} |
| Lock | LOCK#{key} | LOCK |
| Cascade marker | PIPELINE#{id} | MARKER#{date}#{scheduleID} |
Runs use a dual-write pattern: a truth item at RUN#{id}/STATE for direct lookup and CAS, plus a list copy at PIPELINE#{id}/RUN#{id} for pipeline-scoped queries. Locks use conditional PutItem with attribute_not_exists(PK) OR #ttl < :now — same acquire-or-expire semantics as Redis SetNX with TTL.
5 Lambda Handlers
| Lambda | Role |
|---|---|
stream-router | DynamoDB Stream -> Step Function, dedup by {pipeline}:{date}:{scheduleID} |
orchestrator | Multi-action dispatcher: exclusion, locking, run log, readiness, SLA, drift |
evaluator | Single trait evaluation via HTTP POST to evaluator API |
trigger | CAS state transitions (PENDING -> TRIGGERING -> RUNNING), execute trigger |
run-checker | Poll external system for job completion status |
All handlers use lazy dependency initialization via sync.Once and accept a Deps struct for testability.
47-State Step Function
The ASL definition orchestrates the full lifecycle in phases:
- Gate checks: InitDefaults -> CheckExclusion -> AcquireLock -> CheckRunLog
- Evaluation: ResolvePipeline -> EvaluateTraits (Map/parallel) -> CheckEvaluationSLA -> CheckReadiness
- Execution: TriggerPipeline -> WaitForRun (30s loop) -> PollRunStatus (max 120 iterations)
- Completion: CheckCompletionSLA -> LogCompleted -> NotifyDownstream -> ReleaseLock
- Post-completion monitoring: Re-evaluate traits every 60s, detect drift, handle late arrivals
- Retry: On failure, backoff wait then re-acquire lock and loop
Every phase branches through Choice states with clean exit paths — if the pipeline is excluded, the lock can’t be acquired, or the run already completed today, the execution ends early without wasting Lambda invocations. Failed triggers check the retry policy and either loop back with exponential backoff or release the lock and exit.
Key Design Decisions
Distributed Locking and CAS
Lock keys follow the format eval:{pipeline}:{scheduleID}. Two schedules for the same pipeline evaluate concurrently; two evaluations for the same schedule are serialized. Lock TTL is calculated dynamically based on trait count and timeout. State transitions use compare-and-swap everywhere — Redis via Lua scripts, DynamoDB via ConditionExpression on a version field. Same semantics, different mechanisms.
Multi-Schedule Support
Most frameworks assume one run per pipeline per day. Interlock supports arbitrary schedules:
| |
Run logs are keyed by {pipeline}:{date}:{scheduleID}, so each window is tracked independently. Pipelines without explicit schedules get an implicit daily schedule.
Calendar Exclusions
Interlock supports named calendar files and inline exclusion rules (specific dates and weekdays). Exclusion is checked before acquiring a lock — excluded days are completely dormant with zero resource consumption.
8 Trigger Types
| Type | Backend | Status Check |
|---|---|---|
command | Shell subprocess | Exit code |
http | HTTP webhook | Response status |
airflow | Airflow REST API | DAG run state |
glue | AWS Glue SDK | Job run status |
emr | EMR SDK | Step status |
emr-serverless | EMR Serverless SDK | Job run status |
step-function | Step Functions SDK | Execution status |
databricks | REST API 2.1 | Job run status |
Each AWS trigger type defines a narrow SDK interface (2 methods) for testability. Lazy client initialization means unused SDKs are never loaded.
Real-World Example: Medallion Pipeline
I built interlock-aws-example as a working reference deployment. It implements a medallion architecture (bronze -> silver -> gold) with real data sources:
- GH Archive — GitHub event data, hourly HTTP bulk download
- Open-Meteo — Weather forecast data, hourly REST API
This produces 4 pipelines (2 silver, 2 gold), each with 24 hourly schedules (h00 through h23) — 96 schedule windows total.
Cascade Triggering
Silver completion automatically triggers gold evaluation. When gharchive-silver completes for schedule h14, Interlock writes a cascade marker that fires the gharchive-gold Step Function execution. No cron delay, no polling — the downstream pipeline evaluates as soon as its upstream dependency is met.
Infrastructure
The example deploys to AWS with Terraform: 1 DynamoDB table, 1 Step Function, 5 Go Lambda handlers, 3 Python evaluator Lambdas, 1 alert-logger Lambda, 2 EventBridge rules, 4 Glue jobs, 1 API Gateway, 1 SNS topic, and 1 S3 bucket. The entire stack is defined in flat Terraform with for_each on the core Lambda/Glue/EventBridge resources.
Interlock is published as a Go module — the example repo imports it directly with go get github.com/dwsmith1983/interlock@v0.1.0.
What’s Next
Cloud providers. The current implementation covers local (Redis + Postgres) and AWS (DynamoDB + Lambda + Step Functions). GCP (Firestore + Cloud Run + Workflows) and Azure (Cosmos DB + Functions + Durable Functions) are planned.
More complex testing. The medallion pipeline is a proof of concept with straightforward data sources. I need production workloads — more pipelines, more complex dependency graphs, higher volumes — to flush out edge cases in the evaluation loop and state machine.
Open table format export. Currently run history lives in DynamoDB or Postgres. I want to export to Parquet, Iceberg, and Delta so teams can query pipeline metadata alongside their data lake. The Go ecosystem has good library support: apache/arrow-go for Parquet, apache/iceberg-go for Iceberg.
Documentation. A docs site is live at GitHub Pages, built from the repo. It covers setup, configuration, and the provider interface, but needs expansion as the API surface grows.