Distribution Fitting
Distribution fitting is the process of selecting and parameterizing a probability distribution that best describes observed data. Analysts accomplish this by comparing empirical data against theoretical distributions, adjusting parameters to minimize discrepancies. Methods often include statistical tests (e.g., Kolmogorov-Smirnov, Anderson-Darling), visual assessments (e.g., histograms, Q-Q plots), and numerical criteria (e.g., AIC, BIC, or log-likelihood).
Fitting Correctly
Practical Applications for Analysts
Distribution fitting is foundational for analysts across various fields:
Risk Management & Finance: Analysts fit distributions to asset returns or operational loss data to estimate Value at Risk (VaR), conduct stress testing, and manage financial risks.
Reliability Engineering: By fitting failure-time data (such as machine lifetimes) to distributions like Weibull or Lognormal, analysts predict product lifespan and develop appropriate maintenance schedules.
Quality Control: Analysts use fitted distributions to monitor manufacturing processes, identifying deviations from expected behaviors to maintain quality standards.
Environmental Modeling: Distribution fitting helps predict extreme weather events, flood frequency, and environmental risks by modeling historical weather data or natural phenomena.
Decision Analysis & Simulation: Analysts model uncertainty in decision-making contexts by fitting distributions to historical data, enabling realistic simulations (e.g., Monte Carlo simulations) for strategic planning.
Caveats of Fitting Distributions Empirically
Despite its usefulness, empirical distribution fitting has important limitations:
Data Quantity and Quality: Small or noisy datasets may lead to unstable parameter estimates, resulting in unreliable conclusions.
Outliers and Extreme Values: Extreme observations can significantly influence the fitted distribution parameters, potentially distorting insights unless handled appropriately.
Overfitting Risk: Analysts might select overly complex distributions that closely fit historical data but generalize poorly to new observations, limiting predictive power.
Misinterpretation of Statistical Tests: Statistical tests like Kolmogorov-Smirnov or Anderson-Darling can indicate a good fit even when practical considerations (such as data context) suggest otherwise, or vice versa.
Subjective Judgment: Empirical fitting often requires subjective judgment when choosing between multiple similarly good fits, potentially introducing bias or inconsistency.
Importance of Selecting Appropriate Distributions Beforehand
Understanding appropriate theoretical distributions prior to analysis is crucial:
Domain Knowledge: Analysts familiar with the underlying process or theoretical context of the data (e.g., finance, engineering, biology) can immediately focus on distributions that are known to realistically model the phenomena.
Statistical Properties: Different distributions have distinct properties such as skewness, kurtosis, tail heaviness, and bounds. Analysts must select distributions that inherently align with these features of the data.
Avoiding Mis-Specification: Selecting inappropriate distributions can lead to misinterpretation of risk, poor resource allocation, or misguided policy decisions. Knowledgeable selection beforehand reduces the likelihood of such errors.
Interpreting Results: Using contextually appropriate distributions enhances the interpretability of model parameters, helping stakeholders more clearly understand risks, probabilities, and implications of the analysis.
When used carefully, distribution fitting is a valuable analytical tool enabling informed decision-making under uncertainty. However, analysts must carefully consider data quality, model complexity, and theoretical context to ensure robust, meaningful insights from their fitted distributions.
Functions for analyzing and fitting distributions
- Visualizing Fits:
viz_fit
- Descriptive Fit Statistics:
fit_stats
- Automatic Distribution Fitting:
autofit_dist
Visually analyzing fits
MCHammer.viz_fit
— Functionviz_fit(SampleData; DistFit=[], cumulative=false)
Visualizes sample data against fitted probability density functions (PDFs) and cumulative densidty functions (CDFs).
Arguments
SampleData
: Array of sample data.DistFit
(optional): Array of distribution types to fit. If not provided, defaults to[Normal, LogNormal, Uniform]
.
-cumulative
(optional): Returns the results in cumulative form. Default is in probability density form.
Returns
A plot object with the density of SampleData
overlaid by the fitted PDFs / CDFs.
# Generate sample data from a Normal distribution
rng = MersenneTwister(1)
Random.seed!(42)
sample_data = randn(1000)
fit_result = viz_fit(sample_data)
Fit vs Data Stats
MCHammer.fit_stats
— Functionfit_stats(SampleData; DistFit=[], pvals=true, Increment=0.1)
Calculates descriptive statistics for the sample data and for each fitted distribution in DistFit
.
Arguments
SampleData
: Array of sample data.DistFit
(optional): Pre-selected array of distribution types to fit. Defaults to[Normal, LogNormal, Uniform]
if not provided.pvals
(optional): Displays percentiles for the sample data and the fitsIncrement
(optional): Increment for the percentiles (e.g., 0.1 for 0%, 10%, …, 100%). Default is 0.1.
Returns
A DataFrame (transposed) containing descriptive statistics (mean, median, mode, std, variance, skewness, kurtosis, etc.) for the sample data and for each fitted distribution.
When pvals = true
percentiles are added the to stats table
Percentile
: The percentile (in %).SampleData
: The corresponding quantile of the sample data.- One column per fitted distribution containing the theoretical quantiles.
# Generate sample data from a LogNormal distribution
Random.seed!(42)
sample_data = rand(LogNormal(0, 1), 1000)
fits = fit_stats(sample_data)
show(fits, allrows=true, allcols=true)
22×5 DataFrame
Row │ Name Sample Data Distributions.Normal Distributions.LogNormal Distributions.Uniform
│ Any Any Any Any Any
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Mean 1.52063 1.52063 1.53291 11.0431
2 │ Median 0.956697 1.52063 0.941657 11.0431
3 │ Mode 0.695338 1.52063 0.355341 11.0431
4 │ Standard_Deviation 1.85148 1.85055 1.96907 6.34239
5 │ Variance 3.42797 3.42454 3.87723 40.2259
6 │ Skewness 4.24973 0.0 5.97307 0.0
7 │ Kurtosis 30.1931 0.0 101.604 -1.2
8 │ Coeff_Variation 1.21757 1.21696 1.28453 0.574329
9 │ Minimum 0.0577953 -Inf 0.0 0.0577953
10 │ Maximum 22.0285 Inf Inf 22.0285
11 │ MeanStdError 0.0585489 NaN NaN NaN
12 │ 0.0 0.0577953 -Inf 0.0 0.0577953
13 │ 10.0 0.255968 -0.850948 0.265733 2.25486
14 │ 20.0 0.407048 -0.0368346 0.410261 4.45193
15 │ 30.0 0.569005 0.550199 0.56113 6.649
16 │ 40.0 0.755634 1.0518 0.733287 8.84606
17 │ 50.0 0.956697 1.52063 0.941657 11.0431
18 │ 60.0 1.20084 1.98946 1.20924 13.2402
19 │ 70.0 1.62215 2.49106 1.58024 15.4373
20 │ 80.0 2.22462 3.07809 2.16135 17.6343
21 │ 90.0 3.37068 3.89221 3.33687 19.8314
22 │ 100.0 22.0285 Inf Inf 22.0285
Automatic Fitting
MCHammer.autofit_dist
— Functionautofit_dist(SampleData; DistFit=nothing, FitLib=nothing, sort="AIC", verbose=false)
Fits a list of candidate distributions to the provided sample data, computes goodness-of-fit statistics, and returns a DataFrame summarizing the results.
Arguments
SampleData
: Array of sample data.DistFit
(optional): Array of distribution types to attempt (e.g.,[Normal, Gamma, Exponential]
). Each element must be a type that is a subtype ofDistribution
. If provided, this overridesFitLib
.FitLib
(optional): Symbol indicating a predefined library of distributions to use. Valid options include:all
,:continuous
, or:discrete
. Defaults to:continuous
if neitherDistFit
norFitLib
is provided.sort
(optional): String indicating the criterion to sort the results. Options include"ad"
,"ks"
,"ll"
, or"AIC"
. Defaults to"AIC"
.verbose
(optional): Boolean flag indicating whether to print warnings for distributions that fail to fit. Defaults tofalse
.
Returns
A DataFrame with the following columns:
DistName
: The name of the distribution type.ADTest
: Anderson-Darling test statistic.KSTest
: Kolmogorov-Smirnov test statistic.AIC
: Akaike Information Criterion.AICc
: Corrected AIC.LogLikelihood
: Log-likelihood of the fit.FitParams
: The parameters of the fitted distribution.
Random.seed!(42)
sample_data = rand(LogNormal(0, 1), 1000)
fits = autofit_dist(sample_data)
show(fits, allrows=true, allcols=true)
[ Info: No `DistFit` or `FitLib` provided, defaulting to FitLib = :all.
13×7 DataFrame
Row │ DistName ADTest KSTest AIC AICc LogLikelihood FitParams
│ String Float64 Float64 Float64 Float64 Float64 Any
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Distributions.LogNormal 0.214931 0.0158499 2694.89 2694.9 -1345.44 (-0.0601137, 0.9872)
2 │ Distributions.InverseGaussian 3.71398 0.0561168 2724.91 2724.92 -1360.45 (1.52063, 0.938592)
3 │ Distributions.Gamma 9.73695 0.0766574 2825.48 2825.49 -1410.74 (1.18215, 1.28633)
4 │ Distributions.Exponential 10.8273 0.0653088 2840.25 2840.25 -1419.12 (1.52063,)
5 │ Distributions.Weibull 10.324 0.0664369 2840.86 2840.87 -1418.43 (1.02719, 1.54015)
6 │ Distributions.Cauchy 39.889 0.211156 3352.57 3352.59 -1674.29 (0.956697, 0.702138)
7 │ Distributions.Laplace 48.0785 0.207575 3431.8 3431.81 -1713.9 (0.956697, 1.02097)
8 │ Distributions.Pareto Inf 0.323936 3936.38 3936.39 -1966.19 (0.358329, 0.0577953)
9 │ Distributions.Normal 86.8675 0.214622 4072.84 4072.86 -2034.42 (1.52063, 1.85055)
10 │ Distributions.Rayleigh 418.213 0.380772 4229.75 4229.76 -2113.88 (1.69364,)
11 │ Distributions.Uniform Inf 0.753007 6183.42 6183.43 -3089.71 (0.0577953, 22.0285)
12 │ Distributions.Categorical{P} whe… Inf 0.001 13819.5 13819.5 -6907.76 ([0.0577953, 0.0640543, 0.064986…
13 │ Distributions.DiscreteNonParamet… Inf 0.001 13819.5 13819.5 -6907.76 ([0.0577953, 0.0640543, 0.064986…
Sources & References
- Eric Torkia, Decision Superhero Vol. 2, chapter 3 : Superpower: Modeling the Behaviors of Inputs, Technics Publishing, 2025
- Available on Amazon : https://a.co/d/4YlJFzY . Volumes 2 and 3 to be released in Spring and Fall 2025.