v0.6.0 (March 2026): Failure classification, per-source retry budgets, bounded job polling, Glue false-success detection. See What’s New in v0.6.
v0.5.0 (February 2026): Centralized observability, Slack Bot API threading, proactive SLA monitoring.
Every data engineer has lived through it: a pipeline fires on schedule, the upstream data isn’t there yet, and you find out three hours later when the dashboard is empty and someone in Slack is asking why the numbers look wrong. The pipeline didn’t fail in the traditional sense — the orchestrator ran successfully, the job completed without errors. It just operated on incomplete or stale data and produced garbage.
The root cause isn’t a bug. It’s a missing safety constraint. There’s no gate between “it’s time to run” and “the inputs are actually ready.”
This idea — that accidents come from inadequate control constraints rather than simple component failures — is the core thesis of STAMP (Systems-Theoretic Accident Model and Processes), developed by Nancy Leveson at MIT. STAMP was designed for aerospace and nuclear systems, but the model applies directly to data pipelines. A batch job that runs before its upstream dependency completes isn’t a failed component. It’s a system without adequate feedback.
I built Interlock to be that feedback mechanism.
What Interlock Does
Interlock is a readiness framework. It sits between your data sources and your pipeline execution, evaluating sensor data against declarative rules before allowing a trigger. If the rules don’t pass, the pipeline doesn’t fire.
The system works at three levels:
- Sensor data — external processes write observations to DynamoDB (timestamps, row counts, status flags). The framework reads but never writes sensor data.
- Declarative rules — 8 check types (
exists,equals,gt,gte,lt,lte,age_lt,age_gt) evaluated against sensor fields. No custom evaluator code needed. - SLA enforcement — proactive monitoring via EventBridge Scheduler entries that fire warnings and breaches at exact timestamps, even when pipelines never trigger.
Declarative Sensor System
Sensors are external processes that write observations directly to DynamoDB as SENSOR# records. A bronze ingestion job writes {"complete": true, "count": 4200} after loading data. A monitoring script writes {"pct_of_expected": 0.92} every hour. The framework doesn’t care how sensor data gets there — it just evaluates rules against whatever fields exist.
The v0.2.0 release used a stdin/stdout evaluator protocol where external programs received JSON and returned pass/fail results. The v0.4.0 release replaced this with a declarative rules DSL that eliminates custom evaluator code entirely:
| |
Pipeline Configuration
A pipeline config defines the full lifecycle: when to evaluate, what to check, how to trigger, and what to monitor after completion.
pipeline— identifier, owner, and descriptionschedule— trigger condition (sensor key + check), evaluation window and interval, optional cron expressionsla— deadline and expected duration for proactive monitoringvalidation— trigger mode (ALLorANY) and an array of declarative rulesjob— trigger type, job-specific config, retry budget, and poll windowpostRun— post-completion rules for drift detection with their own evaluation interval and window
Here’s a sensor-triggered hourly pipeline with post-run monitoring:
| |
No cron schedule — the pipeline fires when hourly-status.complete becomes true. The postRun block re-evaluates rules for 2 hours after job completion to detect data drift.
Run State Machine
Once a pipeline is triggered, Interlock tracks it through a state machine:
PENDING ──> TRIGGERING ──> RUNNING ──> COMPLETED
│ │
│ ├──> JOB_POLL_EXHAUSTED ──> FAILED_FINAL
│ │
└──> FAILED └──> FAILED ──> FAILED_FINAL
State transitions use compare-and-swap (CAS) — every RunState carries a version number, and updates only succeed if the version matches. This prevents concurrent evaluations from clobbering each other. Terminal statuses COMPLETED and FAILED_FINAL are set by a dedicated CompleteTrigger step that also publishes lifecycle events.
Architecture
The architecture is fully event-driven on AWS. DynamoDB Streams notify downstream Lambdas when state changes, a Step Function orchestrates the evaluation lifecycle, and EventBridge routes all events to observability and alerting targets.
Local mode (Redis + Postgres + Docker Compose) was removed in v0.3.0 to focus on the event-driven AWS architecture. GCP and Azure backends are planned after AWS stabilizes.
Sensors ──write──> DynamoDB (control) ──stream──> stream-router
│
┌─────────────────────────┤
│ │
StartExecution EventBridge
│ events
v │
Step Function (orchestrator) ┌────┴─────┐
│ │ │
v event-sink alert-dispatcher
DynamoDB writes │ │
events table Slack
DynamoDB — Multi-Table Design
Four tables with clear access patterns and independent scaling:
| Table | Purpose | Key Entities |
|---|---|---|
control | Pipeline state, sensor data, trigger status, locks | PIPELINE#, SENSOR#, TRIGGER#, LOCK# |
joblog | Run history and audit trail | JOBLOG# entries with status, timestamps, failure categories |
rerun | Re-run requests and budget tracking | RERUN_REQUEST#, RERUN_COUNT# |
events | Centralized event log (14 event types) | PK/SK with GSI1 on eventType + timestamp |
The control table has DynamoDB Streams enabled. When a sensor write or trigger status change hits the stream, the stream-router evaluates whether to start a Step Function execution. Locks use conditional PutItem with attribute_not_exists(PK) OR #ttl < :now — same acquire-or-expire semantics across all lock types.
6 Lambda Handlers
| Lambda | Role |
|---|---|
stream-router | DynamoDB Stream → Step Function (trigger conditions), routes events to EventBridge, validates rerun requests, detects late data arrival |
orchestrator | Multi-mode dispatcher: evaluate rules, trigger jobs, check-job status, post-run validation, complete-trigger |
sla-monitor | 5 modes: schedule, cancel, fire-alert, calculate, reconcile. Manages EventBridge Scheduler lifecycle for proactive SLA monitoring |
watchdog | Stale trigger detection, missed cron schedules, sensor-triggered pipeline reconciliation, monitoring expiry |
event-sink | Writes all 14 EventBridge event types to the events table for centralized observability |
alert-dispatcher | SQS → Slack Bot API with message threading by pipeline-day. Thread records group related alerts into Slack threads |
All handlers use lazy dependency initialization via sync.Once and accept a Deps struct for testability.
Step Function Lifecycle
The ASL definition orchestrates the full lifecycle as a sequential state machine with clean exit paths at every phase:
- Gate checks: InitDefaults → CheckExclusion → AcquireLock → CheckRunLog
- Sensor evaluation: EvaluateRules (declarative rule checks against sensor data) → WaitForReadiness (retry loop with configurable interval)
- Trigger + bounded job polling: TriggerPipeline → PollJobStatus (bounded by
jobPollWindowSeconds, default 1h) → HandleJobPollExhausted on timeout - Complete trigger: CompleteTrigger sets terminal status (
COMPLETEDorFAILED_FINAL), publishes lifecycle events, cancels SLA schedules - PostRun monitoring: Re-evaluate postRun rules at configured intervals, detect drift, handle late arrivals
- SLA checks: EventBridge Scheduler entries fire warnings and breaches at exact timestamps throughout the lifecycle
Every phase branches through Choice states with early exit — if the pipeline is excluded, the lock can’t be acquired, or the run already completed, the execution ends without wasting Lambda invocations. The Step Function has a global timeout (default 4h, configurable via sfn_timeout_seconds) to prevent unbounded execution.
Key Design Decisions
Distributed Locking and CAS
Lock keys follow the format eval:{pipeline}:{scheduleID}. Two schedules for the same pipeline evaluate concurrently; two evaluations for the same schedule are serialized. Lock TTL is calculated dynamically — ResolveTriggerLockTTL() reads the Step Function timeout and adds a 30-minute buffer. State transitions use compare-and-swap everywhere via DynamoDB ConditionExpression on a version field.
Multi-Schedule Support
Interlock supports arbitrary schedules at daily or hourly granularity. When sensors include both date and hour fields, the framework uses a composite execution date (e.g., 2026-03-03T10). Glue triggers automatically receive --par_day and --par_hour arguments.
Run logs are keyed by {pipeline}:{date}:{scheduleID}, so each window is tracked independently. Pipelines with cron schedules fire on their configured cadence; sensor-triggered pipelines fire as soon as the trigger condition is met.
Calendar Exclusions
Interlock supports named calendar files and inline exclusion rules (specific dates and weekdays). Exclusion is checked before acquiring a lock — excluded days are completely dormant with zero resource consumption.
8 Trigger Types
| Type | Backend | Status Check |
|---|---|---|
command | Shell subprocess | Exit code |
http | HTTP webhook | Response status |
airflow | Airflow REST API | DAG run state |
glue | AWS Glue SDK | Job run status + CloudWatch RCA log cross-check |
emr | EMR SDK | Step status |
emr-serverless | EMR Serverless SDK | Job run status |
step-function | Step Functions SDK | Execution status |
databricks | REST API 2.1 | Job run status |
Each AWS trigger type defines a narrow SDK interface (2 methods) for testability. The glue type cross-checks the CloudWatch RCA log stream when Glue reports SUCCEEDED — Glue can return success when the Spark driver catches a SparkException and exits cleanly. The RCA log captures the actual failure reason.
Real-World Example: Medallion Pipeline
I built interlock-aws-example as a working reference deployment. It implements a medallion architecture (bronze → silver → gold) with real data sources:
- USGS Earthquakes — seismic event data, ingested every 20 minutes
- CoinLore Crypto — cryptocurrency market data, ingested every 20 minutes
This produces 4 pipelines (earthquake-silver, earthquake-gold, crypto-silver, crypto-gold) with hourly ETL schedules. Bronze pipelines are sensor-triggered via hourly-status — no cron schedules. When the bronze ingestion writes sensor data, the silver pipeline evaluates immediately.
Cascade Triggering and PostRun Monitoring
Silver completion automatically triggers gold evaluation. When earthquake-silver completes for hour T10, Interlock writes a cascade marker that fires the earthquake-gold Step Function execution. No cron delay, no polling — the downstream pipeline evaluates as soon as its upstream dependency is met.
Silver pipelines use postRun rules to detect data drift after job completion. If new data arrives that changes the aggregation, the postRun evaluation detects the drift and triggers a cascade re-run with its own retry budget (maxDriftReruns).
Dashboard
The example includes a React + Vite dashboard with a timeline swimlane view. Each pipeline gets a row, each hour gets a cell, and clicking a cell drills down into the full event history for that execution. The dashboard reads directly from the DynamoDB events table.
Infrastructure
The example deploys to AWS with Terraform: 4 DynamoDB tables, 1 Step Function, 6 Go Lambda handlers, 4 Glue jobs (Delta Lake MERGE), EventBridge rules + schedulers, SQS alert queue with DLQ, and 1 S3 bucket. The entire stack is defined as a reusable Terraform module with for_each on the core Lambda/Glue/EventBridge resources.
Interlock is published as a Go module — the example repo imports it directly with go get github.com/dwsmith1983/interlock@v0.6.0.
What’s New in v0.6
| Version | Date | Highlights |
|---|---|---|
| v0.6.0 | 2026-03-07 | Failure classification, per-source retry budgets, bounded job polling, Glue false-success detection |
| v0.5.0 | 2026-03-04 | Centralized observability (event-sink + alert-dispatcher), Slack Bot API threading, proactive SLA monitoring |
| v0.4.0 | 2026-03-03 | Declarative sensors/rules, multi-table DynamoDB, sub-daily execution, EventBridge SLA monitoring |
| v0.3.0 | 2026-02-27 | Post-completion monitoring expiry, watchdog improvements |
Declarative Sensors and Rules
The v0.4.0 release replaced the archetype/trait/evaluator system with a declarative sensor and rules model. External processes write sensor data to DynamoDB; pipeline configs define rules as YAML. Eight check types cover the common validation patterns:
- Value checks:
exists,equals,gt,gte,lt,lte - Temporal checks:
age_lt(field updated within duration),age_gt(field older than duration)
This decouples data collection from validation. A sensor can be a Lambda, a cron job, or an inline step in your ETL — anything that can write to DynamoDB. The rules engine evaluates whatever fields exist without custom code.
Centralized Observability
The v0.5.0 release added an event-sink Lambda that writes all 14 EventBridge event types to a dedicated events table with a GSI for querying by event type and timestamp. The alert-dispatcher Lambda reads from an SQS queue and delivers formatted Slack notifications using the Bot API with message threading.
Thread records (THREAD#{scheduleId}#{date}) stored in the events table group related alerts into Slack threads by pipeline, schedule, and date. First alert for a pipeline-day creates a new thread; subsequent alerts reply in-thread. This keeps Slack channels usable even with dozens of hourly pipelines.
14 event types: TRIGGER_STARTED, JOB_STARTED, JOB_COMPLETED, JOB_FAILED, JOB_POLL_EXHAUSTED, VALIDATION_EXHAUSTED, SLA_MET, SLA_WARNING, SLA_BREACH, SCHEDULE_MISSED, LATE_DATA_ARRIVAL, RERUN_REJECTED, RETRY_EXHAUSTED, TRIGGER_RECOVERED.
Failure Classification and Retry Budgets
The v0.6.0 release introduces failure classification. The StatusChecker interface propagates a FailureCategory (PERMANENT or TRANSIENT) through job status checks to the joblog.
This enables per-source retry budgets:
maxRetries— retries for transient infrastructure failures (e.g.,ConcurrentRunsExceededException)maxCodeRetries— retries for permanent code failures (default 1, so code bugs don’t burn the full retry budget)maxDriftReruns— budget for drift-triggered re-runs (postRun monitoring)maxManualReruns— budget for external rerun requests
Each budget is tracked independently via CountRerunsBySource, so drift reruns don’t consume the job-failure retry budget. All fields use *int pointer semantics — nil uses the default, 0 disables.
Bounded Job Polling
The jobPollWindowSeconds field (default: 3600s / 1h) caps how long the Step Function polls for job completion. When the window expires, the orchestrator writes a timeout joblog entry, sets the trigger to FAILED_FINAL, and publishes a JOB_POLL_EXHAUSTED event. This prevents unbounded check-job polling when external jobs hang indefinitely.
PostRun Monitoring
The postRun block enables drift detection after job completion. Rules are re-evaluated at a configured interval within a time window. If post-run rules fail (e.g., row count drops, freshness degrades), the system can trigger cascade re-runs for downstream pipelines.
This catches a class of problem that pre-run validation can’t: data that was valid when the job ran but changed afterward. Late-arriving records, upstream corrections, or partition overwrites all produce drift that postRun monitoring detects.
Safety Hardening
Several safety features were added across v0.5–v0.6:
- Glue false-success detection: cross-checks CloudWatch RCA log stream when Glue reports
SUCCEEDED. Catches cases where the Spark driver exits cleanly despite aSparkException. - Step Function global timeout: configurable via
sfn_timeout_secondsTerraform variable (default 4h). Prevents unbounded execution if the orchestrator loop stalls. - Pipeline config validation:
ValidatePipelineConfigenforces bounds on all retry/rerun fields at config load time. Invalid configs are logged and skipped. - SLA alert suppression:
handleSLAFireAlertchecks trigger status before publishing — suppresses warnings and breaches for pipelines that already completed or permanently failed. - Watchdog forward-only alerting: skips cron schedules whose most recent expected fire time is before the Lambda’s cold start, preventing retroactive alerts after deploys.
What’s Next
Cloud providers. GCP (Firestore + Cloud Run + Workflows) and Azure (Cosmos DB + Functions + Durable Functions) are planned after the AWS implementation stabilizes. The core framework is backend-agnostic; the declarative rules engine and state machine are pure Go with no AWS dependencies.
Dashboard improvements. The React + Vite dashboard currently reads directly from DynamoDB. Planned additions include historical trend views, SLA compliance dashboards, and alerting configuration.
Links
- Interlock: github.com/dwsmith1983/interlock
- AWS Example: github.com/dwsmith1983/interlock-aws-example
- Go Module: pkg.go.dev/github.com/dwsmith1983/interlock
- Docs: dwsmith1983.github.io/interlock