Updated January 5, 2026: This post has been updated for spark-bestfit v2.0.0, which introduces a class-based API, multi-backend support (Spark, Ray, Local), and many new features. See Migration from v0.x if upgrading from an earlier version.
I had a dataset with 100 million rows and needed to figure out which statistical distribution fit it best. The obvious approach—collect everything to the driver and use scipy—immediately OOM’d. Even after sampling down to something reasonable, I was still looking at 800MB sitting in driver memory just to test one distribution. And I needed to test about 100 of them.
The math doesn’t work. If you’re fitting 100 distributions and each one needs the full dataset, you either serialize 800MB to executors 100 times or collect it all to the driver and pray. Neither option scales.
What I Built
spark-bestfit is now at version 2.0.0 with a complete redesign. The library fits ~90 scipy.stats distributions in parallel, with driver memory usage dropping from 800MB to about 1MB—a 1000x reduction.
The v2.0 release adds:
- Multi-backend architecture: Spark, Ray, or local execution
- Class-based API:
DistributionFitterandDiscreteDistributionFitter - 16 discrete distributions for count data
- Bounded fitting, lazy metrics, multi-column fitting, Gaussian copula, bootstrap confidence intervals, distributed sampling, and model serialization
The histogram approach remains the core innovation, but now you have far more flexibility in how and where you run the fitting.
The Problem With Naive Approaches
There are two ways people usually try this, and both fail:
Collect to driver:
| |
Works fine until your dataset is large enough to matter. Then you OOM.
Broadcast to executors:
| |
This works but you’re broadcasting 800MB to every executor. Slow and wasteful.
Histograms Instead of Raw Data
Here’s the trick: you don’t actually need raw data to fit distributions. You need a histogram.
A 100-bin histogram is about 1KB. Broadcast that instead of 800MB and you can fit distributions in parallel without killing your cluster.
The flow:
- Compute histogram using Spark’s distributed aggregation (no data leaves executors)
- Broadcast the tiny histogram (~1KB) once
- Fit distributions in parallel using Pandas UDFs
- Get back a DataFrame with goodness-of-fit metrics
Driver memory goes from 800MB to 1MB. Problem solved.
The v2.0 Class-Based API
Version 2.0 replaces the functional fit_distributions() with a class-based design:
| |
For count data, use the discrete fitter:
| |
The class-based approach enables lazy metric computation, chained operations, and better state management for complex workflows.
Multi-Backend Architecture
The library’s origin is Spark, but v2.0 adds pluggable backends:
| |
All backends use identical scipy fitting, so fit quality is the same regardless of backend. The choice comes down to your infrastructure:
| Backend | Use Case |
|---|---|
| SparkBackend | Production clusters, 100M+ rows |
| RayBackend | ML workflows, Kubernetes deployments |
| LocalBackend | Development, unit testing |
How It Works
The histogram computation uses Spark’s Bucketizer:
| |
That runs as a distributed aggregation. The only thing that comes back to the driver is the histogram itself—a tiny array of bin counts.
The backend abstraction means the same fitting logic runs on Spark, Ray, or locally. The ExecutionBackend protocol defines methods for broadcasting data to workers and parallel execution.
New Features in v2.0
Bounded Distribution Fitting
Fit distributions with natural constraints (percentages, ages, prices):
| |
Multi-Column Fitting
Fit multiple columns in a single operation:
| |
Lazy Metrics
Defer expensive K-S and A-D computation for faster model selection:
| |
Pre-filtering
Skip incompatible distributions based on data shape (20-50% faster):
| |
Bootstrap Confidence Intervals
Quantify parameter uncertainty:
| |
Gaussian Copula
Generate correlated multi-column samples:
| |
Model Serialization
Save and load fitted distributions without Spark:
| |
Visualization
Built-in plotting for distribution comparison:
| |
Performance Impact
The difference is dramatic. On large datasets (100M+ rows):
Traditional approach: You’re either collecting hundreds of megabytes to the driver (OOM risk) or broadcasting that same data to executors repeatedly (slow).
Histogram approach: Driver holds ~1KB of histogram data. Broadcast happens once. The memory footprint drops by roughly 1000x compared to collecting raw data.
The v2.0 features add more optimization opportunities:
lazy_metrics=Trueskips expensive K-S/A-D computation until neededprefilter=Trueeliminates incompatible distributions early- Distribution-aware partitioning spreads slow distributions across executors
When To Use This
If you have millions of rows and need to test many distributions, this makes sense. If your data fits in memory on a single machine, you can still use spark-bestfit with the LocalBackend—no Spark required.
The library is on PyPI and GitHub. Full documentation at spark-bestfit.readthedocs.io.
Migration from v0.x
If you’re upgrading from v0.3.x, here’s the API migration:
| |
Key changes:
fit_distributions()→DistributionFitter(spark).fit()- Results are now a
FitResultscontainer with.best(),.filter(),.quality_report()methods - Each result is a
DistributionFitResultwith.pdf(),.cdf(),.sample(),.save()methods - Q-Q plots moved from
spark_bestfit.plotting.qq_plot()tofitter.plot()
See the Migration Guide for complete details.
The Core Idea
Move computation to data, not data to computation. Histograms are a compact summary that enable distributed fitting without shuffling gigabytes around your cluster.
The v2.0 architecture takes this further by abstracting the execution backend. Whether you’re on a Spark cluster, Ray, or your laptop, the same histogram-based approach minimizes data movement.
Nothing revolutionary here—just using distributed primitives correctly.