Reputation: 1933

Get lorenz curve and gini coefficient in pandas

How can I get lorenz curve and gini coefficient with the pandas python package? Similar posts on the gini coefficient and lorenz curve mostly concern numpy or R.

Upvotes: 0

Answers (1)

mouwsy

Reputation: 1933

Here is an example using one function to prepare the lorenz curve and another to get the gini coefficient. I use data from a Pareto II (also known as Lomax) distribution to achieve a suitable distribution for a lorenz curve.

Computation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  

def lorenz_prep(df, x, y):
    df_lorenz_curve = (pd.concat([df, df.head(1).map(lambda x: 0)], ignore_index=True)    # Add the origin (0,0) of the lorenz curve in a own row.
                       .loc[lambda df_: (df_[y] / df_[x]).fillna(0).sort_values().index]  # Sort values according to income per person.
                       .assign(equal_line=lambda df_: df_[x].cumsum() / df[x].sum(),      # Calculate cumulated people shares.
                               lorenz_curve=lambda df_: df_[y].cumsum() / df_[y].sum(),   # Calculate cumulated income shares.
                               )
                       .set_index("equal_line", drop=False)
                       .rename_axis(None)
                       )
    return df_lorenz_curve


def gini(df):
    """
    The following section was consulted to create this function:
    https://de.wikipedia.org/wiki/Gini-Koeffizient#Beispiel
    """
    df_g = df.assign(pop_share=df["equal_line"].diff(),
                     income_share=df["lorenz_curve"].diff(),
                     )
    g = 1 - 2 * ((df_g["lorenz_curve"] - df_g["income_share"] / 2) * df_g["pop_share"]).sum()
    return g


# Create an example dataframe
df = pd.DataFrame({"income": (np.random.default_rng(seed=42).pareto(a=1.2, size=200) + 1) * 1500,
                   "number_of_people": 1,
                   })

# Prepare dataframe.
df_res = df.pipe(lorenz_prep, x="number_of_people", y="income")

# Plot lorenz curve.
df_res[["equal_line", "lorenz_curve"]].plot()
plt.show()

# Get Gini coefficient.
print(df_res.pipe(gini))

Results

Lorenz curve:

Gini coefficient: 0.5471224899542815
gini function checked with this.

Note

The dataset used is in such a manner where one row contains the income of one person meaning that "number_of_people" is always 1. However, with the gini formula provided it is also possible to process data where the the income of a different number of people in total is in one row (for example, for income ranges), e.g.:

# 5 people earn together 2000 and so on.
df = pd.DataFrame({"income": [2000, 4000, 6000, 15000],
                   "number_of_people": [5, 3, 2, 1,],
                   })

Alternative gini function

This is an alternative function to calculate the gini coefficient, which is more common in the literature. However, with this function it is only possible to process datasets in the manner of the first description (one income for one entity).

 def gini_alternative(df, y):
    """
    Use the raw data for this function, 
    do not use lorenz_prep() on the dataset before running this function.
    This function only works when data is of the form of 1 income per entity or similar.
    """
    dfx = df.sort_values(y)
    dfx.index = pd.RangeIndex(start=1, stop=dfx.index.size + 1)
    return ((2 * dfx.index - dfx.index.size - 1) * dfx[y]).sum() / (dfx.index.size**2 * dfx[y].mean())

print(df.pipe(gini_alternative, y="income"))

Mathematical notation^1,2:

$gini formula$

¹https://mathworld.wolfram.com/GiniCoefficient.html
²http://dx.doi.org/10.2307/177185