Distribution Fitting

Distribution fitting is the process of selecting and parameterizing a probability distribution that best describes observed data. Analysts accomplish this by comparing empirical data against theoretical distributions, adjusting parameters to minimize discrepancies. Methods often include statistical tests (e.g., Kolmogorov-Smirnov, Anderson-Darling), visual assessments (e.g., histograms, Q-Q plots), and numerical criteria (e.g., AIC, BIC, or log-likelihood).

Fitting Correctly

Practical Applications for Analysts

Distribution fitting is foundational for analysts across various fields:

Risk Management & Finance: Analysts fit distributions to asset returns or operational loss data to estimate Value at Risk (VaR), conduct stress testing, and manage financial risks.
Reliability Engineering: By fitting failure-time data (such as machine lifetimes) to distributions like Weibull or Lognormal, analysts predict product lifespan and develop appropriate maintenance schedules.
Quality Control: Analysts use fitted distributions to monitor manufacturing processes, identifying deviations from expected behaviors to maintain quality standards.
Environmental Modeling: Distribution fitting helps predict extreme weather events, flood frequency, and environmental risks by modeling historical weather data or natural phenomena.
Decision Analysis & Simulation: Analysts model uncertainty in decision-making contexts by fitting distributions to historical data, enabling realistic simulations (e.g., Monte Carlo simulations) for strategic planning.

Caveats of Fitting Distributions Empirically

Despite its usefulness, empirical distribution fitting has important limitations:

Data Quantity and Quality: Small or noisy datasets may lead to unstable parameter estimates, resulting in unreliable conclusions.
Outliers and Extreme Values: Extreme observations can significantly influence the fitted distribution parameters, potentially distorting insights unless handled appropriately.
Overfitting Risk: Analysts might select overly complex distributions that closely fit historical data but generalize poorly to new observations, limiting predictive power.
Misinterpretation of Statistical Tests: Statistical tests like Kolmogorov-Smirnov or Anderson-Darling can indicate a good fit even when practical considerations (such as data context) suggest otherwise, or vice versa.
Subjective Judgment: Empirical fitting often requires subjective judgment when choosing between multiple similarly good fits, potentially introducing bias or inconsistency.

Importance of Selecting Appropriate Distributions Beforehand

Understanding appropriate theoretical distributions prior to analysis is crucial:

Domain Knowledge: Analysts familiar with the underlying process or theoretical context of the data (e.g., finance, engineering, biology) can immediately focus on distributions that are known to realistically model the phenomena.
Statistical Properties: Different distributions have distinct properties such as skewness, kurtosis, tail heaviness, and bounds. Analysts must select distributions that inherently align with these features of the data.
Avoiding Mis-Specification: Selecting inappropriate distributions can lead to misinterpretation of risk, poor resource allocation, or misguided policy decisions. Knowledgeable selection beforehand reduces the likelihood of such errors.
Interpreting Results: Using contextually appropriate distributions enhances the interpretability of model parameters, helping stakeholders more clearly understand risks, probabilities, and implications of the analysis.

When used carefully, distribution fitting is a valuable analytical tool enabling informed decision-making under uncertainty. However, analysts must carefully consider data quality, model complexity, and theoretical context to ensure robust, meaningful insights from their fitted distributions.

Functions for analyzing and fitting distributions

Visualizing Fits: viz_fit
Descriptive Fit Statistics: fit_stats
Automatic Distribution Fitting: autofit_dist

Visually analyzing fits

MCHammer.viz_fit — Function

viz_fit(SampleData; DistFit=[], cumulative=false)

Visualizes sample data against fitted probability density functions (PDFs) and cumulative densidty functions (CDFs).

Arguments

SampleData: Array of sample data.
DistFit (optional): Array of distribution types to fit. If not provided, defaults to [Normal, LogNormal, Uniform].

-cumulative (optional): Returns the results in cumulative form. Default is in probability density form.

Returns

A plot object with the density of SampleData overlaid by the fitted PDFs / CDFs.

source

# Generate sample data from a Normal distribution
rng = MersenneTwister(1)
Random.seed!(42)
sample_data = randn(1000)
fit_result = viz_fit(sample_data)

Fit vs Data Stats

MCHammer.fit_stats — Function

fit_stats(SampleData; DistFit=[], pvals=true, Increment=0.1)

Calculates descriptive statistics for the sample data and for each fitted distribution in DistFit.

Arguments

SampleData: Array of sample data.
DistFit (optional): Pre-selected array of distribution types to fit. Defaults to [Normal, LogNormal, Uniform] if not provided.
pvals (optional): Displays percentiles for the sample data and the fits
Increment (optional): Increment for the percentiles (e.g., 0.1 for 0%, 10%, …, 100%). Default is 0.1.

Returns

A DataFrame (transposed) containing descriptive statistics (mean, median, mode, std, variance, skewness, kurtosis, etc.) for the sample data and for each fitted distribution.

When pvals = true percentiles are added the to stats table

Percentile: The percentile (in %).
SampleData: The corresponding quantile of the sample data.
One column per fitted distribution containing the theoretical quantiles.

source

# Generate sample data from a LogNormal distribution
Random.seed!(42)
sample_data = rand(LogNormal(0, 1), 1000)
fits = fit_stats(sample_data)
show(fits, allrows=true, allcols=true)

22×5 DataFrame
 Row │ Name                Sample Data  Distributions.Normal  Distributions.LogNormal  Distributions.Uniform
     │ Any                 Any          Any                   Any                      Any
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Mean                1.52063      1.52063               1.53291                  11.0431
   2 │ Median              0.956697     1.52063               0.941657                 11.0431
   3 │ Mode                0.695338     1.52063               0.355341                 11.0431
   4 │ Standard_Deviation  1.85148      1.85055               1.96907                  6.34239
   5 │ Variance            3.42797      3.42454               3.87723                  40.2259
   6 │ Skewness            4.24973      0.0                   5.97307                  0.0
   7 │ Kurtosis            30.1931      0.0                   101.604                  -1.2
   8 │ Coeff_Variation     1.21757      1.21696               1.28453                  0.574329
   9 │ Minimum             0.0577953    -Inf                  0.0                      0.0577953
  10 │ Maximum             22.0285      Inf                   Inf                      22.0285
  11 │ MeanStdError        0.0585489    NaN                   NaN                      NaN
  12 │ 0.0                 0.0577953    -Inf                  0.0                      0.0577953
  13 │ 10.0                0.255968     -0.850948             0.265733                 2.25486
  14 │ 20.0                0.407048     -0.0368346            0.410261                 4.45193
  15 │ 30.0                0.569005     0.550199              0.56113                  6.649
  16 │ 40.0                0.755634     1.0518                0.733287                 8.84606
  17 │ 50.0                0.956697     1.52063               0.941657                 11.0431
  18 │ 60.0                1.20084      1.98946               1.20924                  13.2402
  19 │ 70.0                1.62215      2.49106               1.58024                  15.4373
  20 │ 80.0                2.22462      3.07809               2.16135                  17.6343
  21 │ 90.0                3.37068      3.89221               3.33687                  19.8314
  22 │ 100.0               22.0285      Inf                   Inf                      22.0285

Automatic Fitting

MCHammer.autofit_dist — Function

autofit_dist(SampleData; DistFit=nothing, FitLib=nothing, sort="AIC", verbose=false)

Fits a list of candidate distributions to the provided sample data, computes goodness-of-fit statistics, and returns a DataFrame summarizing the results.

Arguments

SampleData: Array of sample data.
DistFit (optional): Array of distribution types to attempt (e.g., [Normal, Gamma, Exponential]). Each element must be a type that is a subtype of Distribution. If provided, this overrides FitLib.
FitLib (optional): Symbol indicating a predefined library of distributions to use. Valid options include :all, :continuous, or :discrete. Defaults to :continuous if neither DistFit nor FitLib is provided.
sort (optional): String indicating the criterion to sort the results. Options include "ad", "ks", "ll", or "AIC". Defaults to "AIC".
verbose (optional): Boolean flag indicating whether to print warnings for distributions that fail to fit. Defaults to false.

Returns

A DataFrame with the following columns:

DistName: The name of the distribution type.
ADTest: Anderson-Darling test statistic.
KSTest: Kolmogorov-Smirnov test statistic.
AIC: Akaike Information Criterion.
AICc: Corrected AIC.
LogLikelihood: Log-likelihood of the fit.
FitParams: The parameters of the fitted distribution.

source

Random.seed!(42)
sample_data = rand(LogNormal(0, 1), 1000)
fits = autofit_dist(sample_data)
show(fits, allrows=true, allcols=true)

[ Info: No `DistFit` or `FitLib` provided, defaulting to FitLib = :all.
13×7 DataFrame
 Row │ DistName                           ADTest      KSTest     AIC       AICc      LogLikelihood  FitParams
     │ String                             Float64     Float64    Float64   Float64   Float64        Any
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Distributions.LogNormal              0.214931  0.0158499   2694.89   2694.9        -1345.44  (-0.0601137, 0.9872)
   2 │ Distributions.InverseGaussian        3.71398   0.0561168   2724.91   2724.92       -1360.45  (1.52063, 0.938592)
   3 │ Distributions.Gamma                  9.73695   0.0766574   2825.48   2825.49       -1410.74  (1.18215, 1.28633)
   4 │ Distributions.Exponential           10.8273    0.0653088   2840.25   2840.25       -1419.12  (1.52063,)
   5 │ Distributions.Weibull               10.324     0.0664369   2840.86   2840.87       -1418.43  (1.02719, 1.54015)
   6 │ Distributions.Cauchy                39.889     0.211156    3352.57   3352.59       -1674.29  (0.956697, 0.702138)
   7 │ Distributions.Laplace               48.0785    0.207575    3431.8    3431.81       -1713.9   (0.956697, 1.02097)
   8 │ Distributions.Pareto               Inf         0.323936    3936.38   3936.39       -1966.19  (0.358329, 0.0577953)
   9 │ Distributions.Normal                86.8675    0.214622    4072.84   4072.86       -2034.42  (1.52063, 1.85055)
  10 │ Distributions.Rayleigh             418.213     0.380772    4229.75   4229.76       -2113.88  (1.69364,)
  11 │ Distributions.Uniform              Inf         0.753007    6183.42   6183.43       -3089.71  (0.0577953, 22.0285)
  12 │ Distributions.Categorical{P} whe…  Inf         0.001      13819.5   13819.5        -6907.76  ([0.0577953, 0.0640543, 0.064986…
  13 │ Distributions.DiscreteNonParamet…  Inf         0.001      13819.5   13819.5        -6907.76  ([0.0577953, 0.0640543, 0.064986…

Sources & References

Eric Torkia, Decision Superhero Vol. 2, chapter 3 : Superpower: Modeling the Behaviors of Inputs, Technics Publishing, 2025
Available on Amazon : https://a.co/d/4YlJFzY . Volumes 2 and 3 to be released in Spring and Fall 2025.