Reputation: 1933
How can I get lorenz curve and gini coefficient with the pandas python package? Similar posts on the gini coefficient and lorenz curve mostly concern numpy or R.
Upvotes: 0
Views: 144
Reputation: 1933
Here is an example using one function to prepare the lorenz curve and another to get the gini coefficient. I use data from a Pareto II (also known as Lomax) distribution to achieve a suitable distribution for a lorenz curve.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def lorenz_prep(df, x, y):
df_lorenz_curve = (pd.concat([df, df.head(1).map(lambda x: 0)], ignore_index=True) # Add the origin (0,0) of the lorenz curve in a own row.
.loc[lambda df_: (df_[y] / df_[x]).fillna(0).sort_values().index] # Sort values according to income per person.
.assign(equal_line=lambda df_: df_[x].cumsum() / df[x].sum(), # Calculate cumulated people shares.
lorenz_curve=lambda df_: df_[y].cumsum() / df_[y].sum(), # Calculate cumulated income shares.
)
.set_index("equal_line", drop=False)
.rename_axis(None)
)
return df_lorenz_curve
def gini(df):
"""
The following section was consulted to create this function:
https://de.wikipedia.org/wiki/Gini-Koeffizient#Beispiel
"""
df_g = df.assign(pop_share=df["equal_line"].diff(),
income_share=df["lorenz_curve"].diff(),
)
g = 1 - 2 * ((df_g["lorenz_curve"] - df_g["income_share"] / 2) * df_g["pop_share"]).sum()
return g
# Create an example dataframe
df = pd.DataFrame({"income": (np.random.default_rng(seed=42).pareto(a=1.2, size=200) + 1) * 1500,
"number_of_people": 1,
})
# Prepare dataframe.
df_res = df.pipe(lorenz_prep, x="number_of_people", y="income")
# Plot lorenz curve.
df_res[["equal_line", "lorenz_curve"]].plot()
plt.show()
# Get Gini coefficient.
print(df_res.pipe(gini))
Lorenz curve:
Gini coefficient: 0.5471224899542815
gini
function checked with this.
The dataset used is in such a manner where one row contains the income of one person meaning that "number_of_people"
is always 1. However, with the gini
formula provided it is also possible to process data where the the income of a different number of people in total is in one row (for example, for income ranges), e.g.:
# 5 people earn together 2000 and so on.
df = pd.DataFrame({"income": [2000, 4000, 6000, 15000],
"number_of_people": [5, 3, 2, 1,],
})
This is an alternative function to calculate the gini coefficient, which is more common in the literature. However, with this function it is only possible to process datasets in the manner of the first description (one income for one entity).
def gini_alternative(df, y):
"""
Use the raw data for this function,
do not use lorenz_prep() on the dataset before running this function.
This function only works when data is of the form of 1 income per entity or similar.
"""
dfx = df.sort_values(y)
dfx.index = pd.RangeIndex(start=1, stop=dfx.index.size + 1)
return ((2 * dfx.index - dfx.index.size - 1) * dfx[y]).sum() / (dfx.index.size**2 * dfx[y].mean())
print(df.pipe(gini_alternative, y="income"))
Mathematical notation1,2:
1https://mathworld.wolfram.com/GiniCoefficient.html
2http://dx.doi.org/10.2307/177185
Upvotes: 0