Reputation: 101
My investigation originated from the very common saying, "80% of the wealth is in the hands of 20% of the population." I wanted to verify if this is true. After some online research and calculations, I found that this statement implies that wealth follows a Pareto distribution with $\alpha = log_45$, and the CDF of this Pareto distribution would be: $$ F_X(x) = \begin{cases}1 - \left( \frac{x_m}{x} \right)^{log_45 }& \text{for } x \geq x_m, \0 & \text{for } x < x_m.\end{cases} $$ I used billionaire data from Kaggle (https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset). The dataset contains various columns, but the only one I focused on is the 'finalWorth' column. I plotted the Pareto distribution with $\alpha = \log_45$ and the empirical distribution function of the 'finalWorth' column side by side. The x-axis represents Wealth, and the y-axis represents Cumulative Probability.The fit is exceptional! Figure_1
However, I don't know how to show that the data indeed follows a Pareto distribution with high confidence. It seems like the Kolmogorov-Smirnov test is the common way to go. If the p-value is small, then we can reject the null hypothesis (that they are from the same distribution) and claim they are from different distributions. The contrapositive statement: If the data are from the same distribution, the p-value must be large, should be true. So, I am expecting a large p-value here, although this does not imply that the data are from this Pareto distribution.
After conducting the Kolmogorov-Smirnov test with Python using the scipy.stats library, the test returned an extremely small p-value of $3.268 \times 10^{-53}$, indicating the data is not from the Pareto distribution! This perplexed me for days. I still couldn't figure out why, but there could be 3 possible explanations:
Any help would be deeply appreciated!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kstest, pareto
# Load the data
billionaires_data = pd.read_csv("D:/Self Study/Math/Statistic/Billionaires Statistics Dataset.csv")
# Create the ECDF
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x, y
# ECDF for 'finalWorth'
final_worth = billionaires_data['finalWorth']
x_ecdf, y_ecdf = ecdf(final_worth)
print(len(final_worth))
# Calculate alpha and scale for Pareto distribution
alpha = np.log(5) / np.log(4)
scale = np.min(final_worth[final_worth > 0])
# Sequence of wealth values
wealth_values = np.linspace(scale, final_worth.max(), 1000)
# Pareto CDF
pareto_cdf = pareto.cdf(wealth_values, alpha, scale=scale)
# Plot ECDF and Pareto CDF
plt.figure(figsize=(10, 6))
plt.plot(x_ecdf, y_ecdf, label='ECDF', color='blue')
plt.plot(wealth_values, pareto_cdf, label='Pareto CDF', color='red')
plt.title('ECDF of Final Worth vs. Pareto CDF')
plt.xlabel('Wealth (Final Worth)')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.show()
# Perform KS test
ks_statistic, p_value = kstest(final_worth, 'pareto', args=(alpha, 0, scale))
# Print KS test result
print(f"KS Statistic: {ks_statistic}, p-value: {p_value}")
Upvotes: 0
Views: 118