Building Production-Ready Spark Pipelines with Configuration-Driven Architecture

v1.4.0 (January 17, 2026): Spark Connect support, full Structured Streaming, schema contracts, data quality hooks, secrets management. API stable since 1.0.0 (December 28, 2025).

After building data pipelines across four companies over the past several years, I kept running into the same frustrations. SparkSession setup code was copied and pasted between projects. Component configurations were scattered across the codebase. And getting proper observability into what was actually happening in production was always an afterthought tacked on at the end.

What I Built

spark-pipeline-framework is a Scala library that lets you define Spark pipelines using HOCON configuration files instead of hardcoding everything. The framework handles the boring parts—SparkSession management, component wiring, error handling—so you can focus on the actual data transformations.

The core idea is simple: write your pipeline components once, then configure them differently for dev, staging, and production environments without touching code. But the implementation turned out to be more interesting than I expected, especially around using reflection to keep the framework decoupled from application code.

Why Configuration-Driven?

Managing pipeline configurations is messier than it should be. You can use spark-submit’s –conf flags for Spark settings, but what about your business logic? Filter thresholds, output paths, API endpoints—those end up hardcoded or spread across property files, environment variables, and command-line arguments.

With HOCON, everything lives in one place:

pipeline {
  components = [
    {
      name = "Transform"
      class = "com.example.TransformComponent"
      config {
        filterThreshold = 100
        outputPath = "s3://prod-bucket/data"
      }
    }
  ]
}

spark {
  appName = "My ETL Pipeline"
  config {
    "spark.sql.shuffle.partitions" = "200"
  }
}

PureConfig handles type-safe parsing, so you get compile-time guarantees that your configs are valid. No more discovering at runtime that you typo’d a config key or passed a string where an int was expected.

The Reflection Problem

Here’s where it gets interesting. I wanted the framework to instantiate components dynamically based on config files, but I didn’t want the framework to depend on application code at compile time. That creates tight coupling and makes versioning a nightmare.

The solution was to use Scala’s reflection API to load component classes by name at runtime:

components = [
  {
    name = "Transform"
    class = "com.example.TransformComponent"
    config {
      filterThreshold = 100
    }
  }
]

The framework finds the class, looks for its companion object, and calls the apply method with the parsed config. If something goes wrong—missing class, wrong constructor signature, whatever—you get a clear error message instead of a cryptic reflection exception.

This approach means you can build new pipeline components without ever touching the framework code. True plugin architecture.

Observability That Actually Works

The most valuable part of the framework, in my opinion, is the structured logging. Every pipeline run gets a correlation ID that flows through all components. Events are logged as JSON with standardized field names:

1
2
3
4
5
6
7
{
  "event": "component_end",
  "run_id": "abc-123",
  "component_name": "Transform",
  "duration_ms": 1523,
  "status": "success"
}

This integrates directly with tools like Datadog or Splunk. You can track pipeline performance over time, correlate errors across distributed systems, and actually understand what’s happening in production. No more grepping through gigabytes of unstructured logs trying to figure out which component failed.

The logging system uses a hooks pattern where you can compose multiple hooks together. One hook handles JSON logging, another might collect metrics, another might notify Slack on failures. Clean separation of concerns.

Module Structure

The framework is split into three modules because I got tired of dragging Spark dependencies into every component:

spark-pipeline-core has zero Spark dependencies. It’s just the configuration models, hooks trait, and reflection logic. You can use this in tests without spinning up a SparkSession.

spark-pipeline-runtime manages SparkSession lifecycle and provides the DataFlow trait that components implement.

spark-pipeline-runner ties everything together and provides the CLI entry point.

This separation makes testing way easier and keeps dependency graphs clean.

A Real Component

Here’s what a component looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package com.example

import io.github.dwsmith1983.spark.pipeline.runtime.DataFlow
import org.apache.spark.sql.{DataFrame, SparkSession}

case class TransformConfig(filterThreshold: Int)

class TransformComponent(config: TransformConfig) extends DataFlow {
  override def transform(input: DataFrame)(implicit spark: SparkSession): DataFrame = {
    input.filter(s"value > ${config.filterThreshold}")
  }
}

The framework handles instantiation, passes in the parsed config, and calls transform at the right time. Your code focuses on the actual transformation logic.

Cross-Compilation Hell

Supporting Spark 3.5, 4.0, Scala 2.12, and 2.13 simultaneously is not fun. The build file turned into a maze of conditionals and version-specific dependencies. But it works, and now the library can be used across different Spark versions without forking the codebase.

GitHub Actions runs the full test matrix on every commit to make sure nothing breaks. Pre-commit hooks enforce code formatting with scalafmt. Release-please handles semantic versioning automatically.

Getting Started

The project is published to Maven Central, so you can add it to build.sbt:

1
libraryDependencies += "io.github.dwsmith1983" %% "spark-pipeline-runner" % "1.4.0"

Create a configuration file and implement your components as DataFlow instances. The framework handles the rest.

Post-1.0 Features

Version	Feature	Description
1.4.0	Spark Connect	Remote execution, thin client deployments
1.3.0	Structured Streaming	Production-ready with exactly-once semantics
1.2.0	Schema Contracts	DataFrame validation at component boundaries
1.1.0	Data Quality Hooks	Pluggable checks between pipeline stages
1.0.0	API Stability	Semantic versioning guarantees

Spark Connect (v1.4.0)

Remote execution without bundling Spark dependencies:

1
2
3
4
5
6
val spark = SparkConnectSession.builder()
  .remote("sc://spark-cluster:15002")
  .getOrCreate()

val runner = PipelineRunner(spark)
runner.run(pipelineConfig)

Enables cloud-native deployments, serverless functions, and Kubernetes environments with thin client applications.

Structured Streaming (v1.3.0)

pipeline {
  streaming = true
  trigger = "ProcessingTime(10 seconds)"
  checkpointLocation = "s3://bucket/checkpoints"

  components = [
    {
      name = "KafkaSource"
      class = "com.example.KafkaSourceComponent"
      config { brokers = "kafka:9092", topic = "events" }
    }
  ]
}

Handles checkpoint management, exactly-once semantics, graceful shutdown.

Schema Contracts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class TransformComponent(config: TransformConfig) extends DataFlow with SchemaContract {
  override def inputSchema: StructType = StructType(Seq(
    StructField("id", LongType, nullable = false),
    StructField("value", DoubleType, nullable = true)
  ))

  override def outputSchema: StructType = StructType(Seq(
    StructField("id", LongType, nullable = false),
    StructField("processed_value", DoubleType, nullable = false)
  ))

  override def transform(input: DataFrame)(implicit spark: SparkSession): DataFrame =
    input.withColumn("processed_value", col("value") * 2)
}

Data Quality Hooks

1
2
3
4
5
val qualityHook = DataQualityHook(
  checks = Seq(NullCheck("id"), RangeCheck("value", 0, 1000), UniquenessCheck("id")),
  onFailure = FailPipeline  // or LogAndContinue, QuarantineRows
)
val runner = PipelineRunner(spark, hooks = Seq(qualityHook))

Checkpointing and Circuit Breakers

pipeline {
  checkpointing { enabled = true, location = "s3://bucket/checkpoints", interval = "5 minutes" }
  circuitBreaker { failureThreshold = 3, resetTimeout = "1 minute" }
}

Secrets Management

Integration with HashiCorp Vault, AWS Secrets Manager, environment variables:

secrets { provider = "vault", vault { address = "https://vault.example.com", path = "secret/data/pipeline" } }

pipeline {
  components = [{ name = "DatabaseSink", config { password = "${secrets:db_password}" } }]
}

Code on GitHub.