Danny Wen
Danny Wen

Reputation: 101

Kolmogorov-Smirnov test fails with Pareto distribution

My investigation originated from the very common saying, "80% of the wealth is in the hands of 20% of the population." I wanted to verify if this is true. After some online research and calculations, I found that this statement implies that wealth follows a Pareto distribution with $\alpha = log_45$, and the CDF of this Pareto distribution would be: $$ F_X(x) = \begin{cases}1 - \left( \frac{x_m}{x} \right)^{log_45 }& \text{for } x \geq x_m, \0 & \text{for } x < x_m.\end{cases} $$ I used billionaire data from Kaggle (https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset). The dataset contains various columns, but the only one I focused on is the 'finalWorth' column. I plotted the Pareto distribution with $\alpha = \log_45$ and the empirical distribution function of the 'finalWorth' column side by side. The x-axis represents Wealth, and the y-axis represents Cumulative Probability.The fit is exceptional! Figure_1

However, I don't know how to show that the data indeed follows a Pareto distribution with high confidence. It seems like the Kolmogorov-Smirnov test is the common way to go. If the p-value is small, then we can reject the null hypothesis (that they are from the same distribution) and claim they are from different distributions. The contrapositive statement: If the data are from the same distribution, the p-value must be large, should be true. So, I am expecting a large p-value here, although this does not imply that the data are from this Pareto distribution.

After conducting the Kolmogorov-Smirnov test with Python using the scipy.stats library, the test returned an extremely small p-value of $3.268 \times 10^{-53}$, indicating the data is not from the Pareto distribution! This perplexed me for days. I still couldn't figure out why, but there could be 3 possible explanations:

  1. My code is wrong somewhere; the error is caused by data handling. The code is attached below.
  2. The wealth distribution indeed does not follow the Pareto distribution, which I think is highly unlikely.
  3. The p-value of the Kolmogorov-Smirnov test, calculated by the formula: $$p \approx 1 - 2 \sum_{i=1}^{\infty} (-1)^{i-1} e^{-2i^2D^2n}$$, somehow approaches zero with the Pareto distribution, maybe due to infinite variance?

Any help would be deeply appreciated!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kstest, pareto

# Load the data
billionaires_data = pd.read_csv("D:/Self Study/Math/Statistic/Billionaires Statistics Dataset.csv")

# Create the ECDF
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n+1) / n
    return x, y

# ECDF for 'finalWorth'
final_worth = billionaires_data['finalWorth']
x_ecdf, y_ecdf = ecdf(final_worth)
print(len(final_worth))

# Calculate alpha and scale for Pareto distribution
alpha = np.log(5) / np.log(4)
scale = np.min(final_worth[final_worth > 0])


# Sequence of wealth values
wealth_values = np.linspace(scale, final_worth.max(), 1000)

# Pareto CDF
pareto_cdf = pareto.cdf(wealth_values, alpha, scale=scale)

# Plot ECDF and Pareto CDF
plt.figure(figsize=(10, 6))
plt.plot(x_ecdf, y_ecdf, label='ECDF', color='blue')
plt.plot(wealth_values, pareto_cdf, label='Pareto CDF', color='red')
plt.title('ECDF of Final Worth vs. Pareto CDF')
plt.xlabel('Wealth (Final Worth)')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.show()

# Perform KS test
ks_statistic, p_value = kstest(final_worth, 'pareto', args=(alpha, 0, scale))

# Print KS test result
print(f"KS Statistic: {ks_statistic}, p-value: {p_value}")

Upvotes: 0

Views: 118

Answers (0)

Related Questions