How to find area between density plots in python?

Question

I was reading a blog about Feature Selection based on the density curves of the features. The blog is in R language and I am not familiar with that.

Blog:
- https://myabakhova.blogspot.com/2016/02/computing-ratio-of-areas.html
- https://www.datasciencecentral.com/profiles/blogs/choosing-features-for-random-forests-algorithm

The blog says if the density curves of two features are significantly different (look below the equation, which says > 0.75), then we can discard one of the features.

Now, I am familiar with how to plot density curves, but not sure how to get the intersection area. Any help with finding the intersection area is greatly appreciated.

Here is my attempt:

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

df = sns.load_dataset('iris').drop('species',axis=1)

# normalize data
x = df.to_numpy()
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

# density plots
x1 = df[0]
x2 = df[1]
sns.distplot(x1)
sns.distplot(x2)

Now, I don't know how to find the area under two separate curves and intersection area.

Question

How to find the area under each curve?
How to find the area of overlapping section?
Do we need to scale features (sepal or petal length to 0 to 1)?

My output

For reference, the blog density curves look like this

Stef · Accepted Answer

How to find the area under each curve?

By numerical integration of the kde curve, e.g. using trapez:

area1 = np.trapz(ax.lines[0].get_ydata(), ax.lines[0].get_xdata())

(should be 1.0 by definition)

How to find the area of overlapping section?

By numerical integration of the minimum of the two kde curves:

ymin = np.minimum(ax.lines[0].get_ydata(), ax.lines[1].get_ydata())
area_overlap = np.trapz(ymin, ax.lines[0].get_xdata())

Do we need to scale features (sepal or petal length to 0 to 1)?

Yes, both ranges must be identically scaled (not necessarily 0 to 1), otherwise step #2 wouldn't work.

The x-ranges of the kde curves must be identical for step #2, therefor we need to explicitly set the interval with the clip keyword for the kdeplot function.

This is the whole program:

clip = {'clip': (-.2,1.2)}
sns.distplot(x1,kde_kws=clip)
ax=sns.distplot(x2,kde_kws=clip)

area1 = np.trapz(ax.lines[0].get_ydata(), ax.lines[0].get_xdata())
area2 = np.trapz(ax.lines[1].get_ydata(), ax.lines[1].get_xdata())
ymin = np.minimum(ax.lines[0].get_ydata(), ax.lines[1].get_ydata())
area_overlap = np.trapz(ymin, ax.lines[0].get_xdata())

print(area1, area2, area_overlap)
#0.9997488977867803 0.9999803817881264 0.8338245964155915

How to find area between density plots in python?

Question

My output

For reference, the blog density curves look like this

Answers (1)

Related Questions