Interlock: A STAMP-Based Safety Framework for Data Pipelines

How I built a STAMP-based safety framework in Go with declarative sensors, failure classification, and centralized observability for data pipeline reliability on AWS

· 12 min read · Updated
data engineering go aws safety dynamodb step functions terraform eventbridge observability

v0.6.0 (March 2026): Failure classification, per-source retry budgets, bounded job polling, Glue false-success detection. See What’s New in v0.6.

v0.5.0 (February 2026): Centralized observability, Slack Bot API threading, proactive SLA monitoring.

Every data engineer has lived through it: a pipeline fires on schedule, the upstream data isn’t there yet, and you find out three hours later when the dashboard is empty and someone in Slack is asking why the numbers look wrong. The pipeline didn’t fail in the traditional sense — the orchestrator ran successfully, the job completed without errors. It just operated on incomplete or stale data and produced garbage.

The root cause isn’t a bug. It’s a missing safety constraint. There’s no gate between “it’s time to run” and “the inputs are actually ready.”

This idea — that accidents come from inadequate control constraints rather than simple component failures — is the core thesis of STAMP (Systems-Theoretic Accident Model and Processes), developed by Nancy Leveson at MIT. STAMP was designed for aerospace and nuclear systems, but the model applies directly to data pipelines. A batch job that runs before its upstream dependency completes isn’t a failed component. It’s a system without adequate feedback.

I built Interlock to be that feedback mechanism.

What Interlock Does

Interlock is a readiness framework. It sits between your data sources and your pipeline execution, evaluating sensor data against declarative rules before allowing a trigger. If the rules don’t pass, the pipeline doesn’t fire.

The system works at three levels:

  1. Sensor data — external processes write observations to DynamoDB (timestamps, row counts, status flags). The framework reads but never writes sensor data.
  2. Declarative rules — 8 check types (exists, equals, gt, gte, lt, lte, age_lt, age_gt) evaluated against sensor fields. No custom evaluator code needed.
  3. SLA enforcement — proactive monitoring via EventBridge Scheduler entries that fire warnings and breaches at exact timestamps, even when pipelines never trigger.

Declarative Sensor System

Sensors are external processes that write observations directly to DynamoDB as SENSOR# records. A bronze ingestion job writes {"complete": true, "count": 4200} after loading data. A monitoring script writes {"pct_of_expected": 0.92} every hour. The framework doesn’t care how sensor data gets there — it just evaluates rules against whatever fields exist.

The v0.2.0 release used a stdin/stdout evaluator protocol where external programs received JSON and returned pass/fail results. The v0.4.0 release replaced this with a declarative rules DSL that eliminates custom evaluator code entirely:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
pipeline:
  id: gold-revenue
  owner: analytics-team
  description: Gold-tier revenue aggregation pipeline

schedule:
  cron: "0 8 * * *"
  timezone: UTC
  trigger:
    key: upstream-complete
    check: equals
    field: status
    value: ready
  evaluation:
    window: 1h
    interval: 5m

sla:
  deadline: "10:00"
  expectedDuration: 30m

validation:
  trigger: "ALL"
  rules:
    - key: upstream-complete
      check: equals
      field: status
      value: ready
    - key: row-count
      check: gte
      field: count
      value: 1000
    - key: freshness
      check: age_lt
      field: updatedAt
      value: 2h

job:
  type: glue
  config:
    jobName: gold-revenue-etl
  maxRetries: 2

Pipeline Configuration

A pipeline config defines the full lifecycle: when to evaluate, what to check, how to trigger, and what to monitor after completion.

  • pipeline — identifier, owner, and description
  • schedule — trigger condition (sensor key + check), evaluation window and interval, optional cron expression
  • sla — deadline and expected duration for proactive monitoring
  • validation — trigger mode (ALL or ANY) and an array of declarative rules
  • job — trigger type, job-specific config, retry budget, and poll window
  • postRun — post-completion rules for drift detection with their own evaluation interval and window

Here’s a sensor-triggered hourly pipeline with post-run monitoring:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
pipeline:
  id: silver-cdr-hour
  owner: data-platform
  description: Silver CDR hourly aggregation — triggered by bronze completion

schedule:
  trigger:
    key: hourly-status
    check: equals
    field: complete
    value: true
  evaluation:
    window: 15m
    interval: 2m

sla:
  deadline: ":30"
  expectedDuration: 10m

validation:
  trigger: "ALL"
  rules:
    - key: hourly-status
      check: gte
      field: pct_of_expected
      value: 0.85

job:
  type: glue
  config:
    jobName: "${environment}-cdr-agg-hour"
    arguments:
      "--s3_bucket": "${s3_bucket}"
  maxRetries: 2
  jobPollWindowSeconds: 3600

postRun:
  evaluation:
    interval: 30m
    window: 2h
  rules:
    - key: hourly-status
      check: gte
      field: count
      value: 1

No cron schedule — the pipeline fires when hourly-status.complete becomes true. The postRun block re-evaluates rules for 2 hours after job completion to detect data drift.

Run State Machine

Once a pipeline is triggered, Interlock tracks it through a state machine:

PENDING ──> TRIGGERING ──> RUNNING ──> COMPLETED
                │              │
                │              ├──> JOB_POLL_EXHAUSTED ──> FAILED_FINAL
                │              │
                └──> FAILED    └──> FAILED ──> FAILED_FINAL

State transitions use compare-and-swap (CAS) — every RunState carries a version number, and updates only succeed if the version matches. This prevents concurrent evaluations from clobbering each other. Terminal statuses COMPLETED and FAILED_FINAL are set by a dedicated CompleteTrigger step that also publishes lifecycle events.

Architecture

The architecture is fully event-driven on AWS. DynamoDB Streams notify downstream Lambdas when state changes, a Step Function orchestrates the evaluation lifecycle, and EventBridge routes all events to observability and alerting targets.

Local mode (Redis + Postgres + Docker Compose) was removed in v0.3.0 to focus on the event-driven AWS architecture. GCP and Azure backends are planned after AWS stabilizes.

Sensors ──write──> DynamoDB (control) ──stream──> stream-router
                                                        │
                              ┌─────────────────────────┤
                              │                         │
                     StartExecution              EventBridge
                              │                    events
                              v                         │
                   Step Function (orchestrator)    ┌────┴─────┐
                              │                    │          │
                              v                event-sink  alert-dispatcher
                        DynamoDB writes            │          │
                                              events table   Slack

DynamoDB — Multi-Table Design

Four tables with clear access patterns and independent scaling:

TablePurposeKey Entities
controlPipeline state, sensor data, trigger status, locksPIPELINE#, SENSOR#, TRIGGER#, LOCK#
joblogRun history and audit trailJOBLOG# entries with status, timestamps, failure categories
rerunRe-run requests and budget trackingRERUN_REQUEST#, RERUN_COUNT#
eventsCentralized event log (14 event types)PK/SK with GSI1 on eventType + timestamp

The control table has DynamoDB Streams enabled. When a sensor write or trigger status change hits the stream, the stream-router evaluates whether to start a Step Function execution. Locks use conditional PutItem with attribute_not_exists(PK) OR #ttl < :now — same acquire-or-expire semantics across all lock types.

6 Lambda Handlers

LambdaRole
stream-routerDynamoDB Stream → Step Function (trigger conditions), routes events to EventBridge, validates rerun requests, detects late data arrival
orchestratorMulti-mode dispatcher: evaluate rules, trigger jobs, check-job status, post-run validation, complete-trigger
sla-monitor5 modes: schedule, cancel, fire-alert, calculate, reconcile. Manages EventBridge Scheduler lifecycle for proactive SLA monitoring
watchdogStale trigger detection, missed cron schedules, sensor-triggered pipeline reconciliation, monitoring expiry
event-sinkWrites all 14 EventBridge event types to the events table for centralized observability
alert-dispatcherSQS → Slack Bot API with message threading by pipeline-day. Thread records group related alerts into Slack threads

All handlers use lazy dependency initialization via sync.Once and accept a Deps struct for testability.

Step Function Lifecycle

The ASL definition orchestrates the full lifecycle as a sequential state machine with clean exit paths at every phase:

  1. Gate checks: InitDefaults → CheckExclusion → AcquireLock → CheckRunLog
  2. Sensor evaluation: EvaluateRules (declarative rule checks against sensor data) → WaitForReadiness (retry loop with configurable interval)
  3. Trigger + bounded job polling: TriggerPipeline → PollJobStatus (bounded by jobPollWindowSeconds, default 1h) → HandleJobPollExhausted on timeout
  4. Complete trigger: CompleteTrigger sets terminal status (COMPLETED or FAILED_FINAL), publishes lifecycle events, cancels SLA schedules
  5. PostRun monitoring: Re-evaluate postRun rules at configured intervals, detect drift, handle late arrivals
  6. SLA checks: EventBridge Scheduler entries fire warnings and breaches at exact timestamps throughout the lifecycle

Every phase branches through Choice states with early exit — if the pipeline is excluded, the lock can’t be acquired, or the run already completed, the execution ends without wasting Lambda invocations. The Step Function has a global timeout (default 4h, configurable via sfn_timeout_seconds) to prevent unbounded execution.

Key Design Decisions

Distributed Locking and CAS

Lock keys follow the format eval:{pipeline}:{scheduleID}. Two schedules for the same pipeline evaluate concurrently; two evaluations for the same schedule are serialized. Lock TTL is calculated dynamically — ResolveTriggerLockTTL() reads the Step Function timeout and adds a 30-minute buffer. State transitions use compare-and-swap everywhere via DynamoDB ConditionExpression on a version field.

Multi-Schedule Support

Interlock supports arbitrary schedules at daily or hourly granularity. When sensors include both date and hour fields, the framework uses a composite execution date (e.g., 2026-03-03T10). Glue triggers automatically receive --par_day and --par_hour arguments.

Run logs are keyed by {pipeline}:{date}:{scheduleID}, so each window is tracked independently. Pipelines with cron schedules fire on their configured cadence; sensor-triggered pipelines fire as soon as the trigger condition is met.

Calendar Exclusions

Interlock supports named calendar files and inline exclusion rules (specific dates and weekdays). Exclusion is checked before acquiring a lock — excluded days are completely dormant with zero resource consumption.

8 Trigger Types

TypeBackendStatus Check
commandShell subprocessExit code
httpHTTP webhookResponse status
airflowAirflow REST APIDAG run state
glueAWS Glue SDKJob run status + CloudWatch RCA log cross-check
emrEMR SDKStep status
emr-serverlessEMR Serverless SDKJob run status
step-functionStep Functions SDKExecution status
databricksREST API 2.1Job run status

Each AWS trigger type defines a narrow SDK interface (2 methods) for testability. The glue type cross-checks the CloudWatch RCA log stream when Glue reports SUCCEEDED — Glue can return success when the Spark driver catches a SparkException and exits cleanly. The RCA log captures the actual failure reason.

Real-World Example: Medallion Pipeline

I built interlock-aws-example as a working reference deployment. It implements a medallion architecture (bronze → silver → gold) with real data sources:

  • USGS Earthquakes — seismic event data, ingested every 20 minutes
  • CoinLore Crypto — cryptocurrency market data, ingested every 20 minutes

This produces 4 pipelines (earthquake-silver, earthquake-gold, crypto-silver, crypto-gold) with hourly ETL schedules. Bronze pipelines are sensor-triggered via hourly-status — no cron schedules. When the bronze ingestion writes sensor data, the silver pipeline evaluates immediately.

Cascade Triggering and PostRun Monitoring

Silver completion automatically triggers gold evaluation. When earthquake-silver completes for hour T10, Interlock writes a cascade marker that fires the earthquake-gold Step Function execution. No cron delay, no polling — the downstream pipeline evaluates as soon as its upstream dependency is met.

Silver pipelines use postRun rules to detect data drift after job completion. If new data arrives that changes the aggregation, the postRun evaluation detects the drift and triggers a cascade re-run with its own retry budget (maxDriftReruns).

Dashboard

The example includes a React + Vite dashboard with a timeline swimlane view. Each pipeline gets a row, each hour gets a cell, and clicking a cell drills down into the full event history for that execution. The dashboard reads directly from the DynamoDB events table.

Infrastructure

The example deploys to AWS with Terraform: 4 DynamoDB tables, 1 Step Function, 6 Go Lambda handlers, 4 Glue jobs (Delta Lake MERGE), EventBridge rules + schedulers, SQS alert queue with DLQ, and 1 S3 bucket. The entire stack is defined as a reusable Terraform module with for_each on the core Lambda/Glue/EventBridge resources.

Interlock is published as a Go module — the example repo imports it directly with go get github.com/dwsmith1983/interlock@v0.6.0.

What’s New in v0.6

VersionDateHighlights
v0.6.02026-03-07Failure classification, per-source retry budgets, bounded job polling, Glue false-success detection
v0.5.02026-03-04Centralized observability (event-sink + alert-dispatcher), Slack Bot API threading, proactive SLA monitoring
v0.4.02026-03-03Declarative sensors/rules, multi-table DynamoDB, sub-daily execution, EventBridge SLA monitoring
v0.3.02026-02-27Post-completion monitoring expiry, watchdog improvements

Declarative Sensors and Rules

The v0.4.0 release replaced the archetype/trait/evaluator system with a declarative sensor and rules model. External processes write sensor data to DynamoDB; pipeline configs define rules as YAML. Eight check types cover the common validation patterns:

  • Value checks: exists, equals, gt, gte, lt, lte
  • Temporal checks: age_lt (field updated within duration), age_gt (field older than duration)

This decouples data collection from validation. A sensor can be a Lambda, a cron job, or an inline step in your ETL — anything that can write to DynamoDB. The rules engine evaluates whatever fields exist without custom code.

Centralized Observability

The v0.5.0 release added an event-sink Lambda that writes all 14 EventBridge event types to a dedicated events table with a GSI for querying by event type and timestamp. The alert-dispatcher Lambda reads from an SQS queue and delivers formatted Slack notifications using the Bot API with message threading.

Thread records (THREAD#{scheduleId}#{date}) stored in the events table group related alerts into Slack threads by pipeline, schedule, and date. First alert for a pipeline-day creates a new thread; subsequent alerts reply in-thread. This keeps Slack channels usable even with dozens of hourly pipelines.

14 event types: TRIGGER_STARTED, JOB_STARTED, JOB_COMPLETED, JOB_FAILED, JOB_POLL_EXHAUSTED, VALIDATION_EXHAUSTED, SLA_MET, SLA_WARNING, SLA_BREACH, SCHEDULE_MISSED, LATE_DATA_ARRIVAL, RERUN_REJECTED, RETRY_EXHAUSTED, TRIGGER_RECOVERED.

Failure Classification and Retry Budgets

The v0.6.0 release introduces failure classification. The StatusChecker interface propagates a FailureCategory (PERMANENT or TRANSIENT) through job status checks to the joblog.

This enables per-source retry budgets:

  • maxRetries — retries for transient infrastructure failures (e.g., ConcurrentRunsExceededException)
  • maxCodeRetries — retries for permanent code failures (default 1, so code bugs don’t burn the full retry budget)
  • maxDriftReruns — budget for drift-triggered re-runs (postRun monitoring)
  • maxManualReruns — budget for external rerun requests

Each budget is tracked independently via CountRerunsBySource, so drift reruns don’t consume the job-failure retry budget. All fields use *int pointer semantics — nil uses the default, 0 disables.

Bounded Job Polling

The jobPollWindowSeconds field (default: 3600s / 1h) caps how long the Step Function polls for job completion. When the window expires, the orchestrator writes a timeout joblog entry, sets the trigger to FAILED_FINAL, and publishes a JOB_POLL_EXHAUSTED event. This prevents unbounded check-job polling when external jobs hang indefinitely.

PostRun Monitoring

The postRun block enables drift detection after job completion. Rules are re-evaluated at a configured interval within a time window. If post-run rules fail (e.g., row count drops, freshness degrades), the system can trigger cascade re-runs for downstream pipelines.

This catches a class of problem that pre-run validation can’t: data that was valid when the job ran but changed afterward. Late-arriving records, upstream corrections, or partition overwrites all produce drift that postRun monitoring detects.

Safety Hardening

Several safety features were added across v0.5–v0.6:

  • Glue false-success detection: cross-checks CloudWatch RCA log stream when Glue reports SUCCEEDED. Catches cases where the Spark driver exits cleanly despite a SparkException.
  • Step Function global timeout: configurable via sfn_timeout_seconds Terraform variable (default 4h). Prevents unbounded execution if the orchestrator loop stalls.
  • Pipeline config validation: ValidatePipelineConfig enforces bounds on all retry/rerun fields at config load time. Invalid configs are logged and skipped.
  • SLA alert suppression: handleSLAFireAlert checks trigger status before publishing — suppresses warnings and breaches for pipelines that already completed or permanently failed.
  • Watchdog forward-only alerting: skips cron schedules whose most recent expected fire time is before the Lambda’s cold start, preventing retroactive alerts after deploys.

What’s Next

Cloud providers. GCP (Firestore + Cloud Run + Workflows) and Azure (Cosmos DB + Functions + Durable Functions) are planned after the AWS implementation stabilizes. The core framework is backend-agnostic; the declarative rules engine and state machine are pure Go with no AWS dependencies.

Dashboard improvements. The React + Vite dashboard currently reads directly from DynamoDB. Planned additions include historical trend views, SLA compliance dashboards, and alerting configuration.