Reputation: 17164
I was reading a blog about Feature Selection based on the density curves of the features. The blog is in R language and I am not familiar with that.
Blog:
- https://myabakhova.blogspot.com/2016/02/computing-ratio-of-areas.html
- https://www.datasciencecentral.com/profiles/blogs/choosing-features-for-random-forests-algorithm
The blog says if the density curves of two features are significantly different (look below the equation, which says > 0.75), then we can discard one of the features.
Now, I am familiar with how to plot density curves, but not sure how to get the intersection area. Any help with finding the intersection area is greatly appreciated.
Here is my attempt:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
df = sns.load_dataset('iris').drop('species',axis=1)
# normalize data
x = df.to_numpy()
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
# density plots
x1 = df[0]
x2 = df[1]
sns.distplot(x1)
sns.distplot(x2)
Now, I don't know how to find the area under two separate curves and intersection area.
Upvotes: 0
Views: 1037
Reputation: 30609
- How to find the area under each curve?
By numerical integration of the kde curve, e.g. using trapez:
area1 = np.trapz(ax.lines[0].get_ydata(), ax.lines[0].get_xdata())
(should be 1.0 by definition)
- How to find the area of overlapping section?
By numerical integration of the minimum of the two kde curves:
ymin = np.minimum(ax.lines[0].get_ydata(), ax.lines[1].get_ydata())
area_overlap = np.trapz(ymin, ax.lines[0].get_xdata())
- Do we need to scale features (sepal or petal length to 0 to 1)?
Yes, both ranges must be identically scaled (not necessarily 0 to 1), otherwise step #2 wouldn't work.
The x-ranges of the kde curves must be identical for step #2, therefor we need to explicitly set the interval with the clip
keyword for the kdeplot
function.
This is the whole program:
clip = {'clip': (-.2,1.2)}
sns.distplot(x1,kde_kws=clip)
ax=sns.distplot(x2,kde_kws=clip)
area1 = np.trapz(ax.lines[0].get_ydata(), ax.lines[0].get_xdata())
area2 = np.trapz(ax.lines[1].get_ydata(), ax.lines[1].get_xdata())
ymin = np.minimum(ax.lines[0].get_ydata(), ax.lines[1].get_ydata())
area_overlap = np.trapz(ymin, ax.lines[0].get_xdata())
print(area1, area2, area_overlap)
#0.9997488977867803 0.9999803817881264 0.8338245964155915
Upvotes: 1