jakes
jakes

Reputation: 2085

Calculating statistics of each kdeplot contour

I've got a following DataFrame:

import pandas as pd 
ex = pd.DataFrame({'x': {51963: 60.0, 52020: 110.0, 52054: 90.0, 52071: 86.0, 52072: 86.0, 52073: 86.0, 52131: 96.0, 52132: 96.0, 52140: 115.0, 52209: 92.0, 52346: 112.0, 52347: 114.0, 52429: 103.0, 52497: 104.0, 52561: 88.0, 52626: 110.0, 52627: 110.0, 52630: 109.0, 52631: 111.0, 52685: 105.0, 52725: 95.0, 52726: 95.0, 52727: 95.0, 52915: 100.0, 52916: 100.0, 52918: 101.0, 52936: 64.0, 52940: 66.0, 52987: 96.0, 52988: 96.0, 53088: 67.0, 53122: 76.0, 53123: 76.0, 53172: 105.0, 53420: 5.0, 53561: 105.0, 53585: 114.0, 53586: 114.0, 53587: 118.0, 53681: 105.0, 53927: 115.0, 53968: 117.0, 53993: 107.0, 54062: 114.0, 54063: 113.0, 54140: 103.0, 54141: 103.0, 54143: 107.0, 54145: 107.0, 54146: 107.0, 54200: 84.0, 54624: 88.0, 54625: 88.0, 54661: 116.0, 54664: 114.0, 54679: 119.0, 54685: 67.0, 54695: 59.0, 54706: 64.0, 54711: 69.0, 54722: 70.0, 54751: 100.0, 54753: 104.0, 54934: 81.0, 54960: 67.0, 55028: 107.0, 55082: 99.0, 55083: 99.0, 55084: 99.0, 55198: 102.0, 55199: 102.0, 55200: 102.0, 55279: 55.0, 55280: 55.0, 55388: 99.0, 55391: 97.0, 55392: 96.0, 55459: 97.0, 55460: 97.0, 55464: 99.0, 55465: 99.0, 55467: 97.0, 55499: 113.0, 55500: 113.0, 55501: 114.0, 55504: 107.0, 111812: 61.0, 111862: 69.0, 111863: 69.0, 111864: 68.0, 111868: 68.0, 111872: 63.0, 111971: 82.0, 111972: 82.0, 111974: 83.0, 111995: 101.0, 111996: 101.0, 111997: 102.0, 112041: 95.0, 112042: 95.0}, 'y': {51963: 41.0, 52020: 45.0, 52054: 57.0, 52071: 12.0, 52072: 12.0, 52073: 13.0, 52131: 26.0, 52132: 26.0, 52140: 34.0, 52209: 19.0, 52346: 47.0, 52347: 45.0, 52429: 39.0, 52497: 18.0, 52561: 12.0, 52626: 54.0, 52627: 54.0, 52630: 53.0, 52631: 51.0, 52685: 35.0, 52725: 37.0, 52726: 37.0, 52727: 37.0, 52915: 58.0, 52916: 58.0, 52918: 58.0, 52936: 34.0, 52940: 41.0, 52987: 52.0, 52988: 52.0, 53088: 28.0, 53122: 52.0, 53123: 52.0, 53172: 32.0, 53420: 37.0, 53561: 13.0, 53585: 28.0, 53586: 28.0, 53587: 21.0, 53681: 26.0, 53927: 38.0, 53968: 38.0, 53993: 35.0, 54062: 32.0, 54063: 31.0, 54140: 41.0, 54141: 41.0, 54143: 33.0, 54145: 36.0, 54146: 36.0, 54200: 24.0, 54624: 14.0, 54625: 14.0, 54661: 40.0, 54664: 41.0, 54679: 39.0, 54685: 43.0, 54695: 59.0, 54706: 44.0, 54711: 28.0, 54722: 18.0, 54751: 22.0, 54753: 22.0, 54934: 57.0, 54960: 51.0, 55028: 22.0, 55082: 19.0, 55083: 19.0, 55084: 19.0, 55198: 27.0, 55199: 27.0, 55200: 27.0, 55279: 44.0, 55280: 44.0, 55388: 29.0, 55391: 30.0, 55392: 33.0, 55459: 14.0, 55460: 14.0, 55464: 9.0, 55465: 9.0, 55467: 10.0, 55499: 11.0, 55500: 11.0, 55501: 8.0, 55504: 14.0, 111812: 40.0, 111862: 24.0, 111863: 24.0, 111864: 21.0, 111868: 23.0, 111872: 5.0, 111971: 18.0, 111972: 18.0, 111974: 16.0, 111995: 14.0, 111996: 14.0, 111997: 12.0, 112041: 15.0, 112042: 15.0}})

For this DataFrame I can plot density using sns.kdeplot() as follows:

import seaborn as sns
ax = sns.kdeplot(df.x, df.y, cmap = 'Reds')
ax.set_xlabel('')
ax.set_ylabel('')

enter image description here

From what I observed, on default, there are 10 contours on sns.kdeplot(), meaning that data is binned into 11 bins, with contours dividing bins with different densities. Let's say I want to take the 3rd contour, counting from the most outside one, and calculate various statistics for it - e.g. area enclosed by this contour, horizontal or vertical range etc.. How can I do that? In other words, how can I calculate 2-dimensional kde, and then calculate the area for which kde is greater than, for example, 0.2?

Upvotes: 0

Views: 2548

Answers (1)

Sam Mason
Sam Mason

Reputation: 16174

if you don't mind using a finite approximation to this (this is probably better than getting contour data from matplotlib) you could do something like:

import numpy as np
import scipy.stats as sps

# estimate kernel density of data
kde = sps.gaussian_kde(ex.values.T)

# get a regular grid of points over our region of interest
xx, yy = np.meshgrid(
    np.linspace(0, 130, 500),
    np.linspace(0, 70, 500))

# calculate probability density on these points
z = kde.pdf([xx.ravel(), yy.ravel()]).reshape(xx.shape)

# note, the above calls are identical to how seaborn does things

# proportion of points above the 30%, i.e. approx the third contour line
zi = z > np.max(z) * 0.3

# print some summaries
print('x = (%.1f, %.1f)' % (min(xx[zi]), max(xx[zi])))
print('y = (%.1f, %.1f)' % (min(yy[zi]), max(yy[zi])))
print('area = %.1f' % (130 * 70 * np.mean(zi)))

which gives me:

x = (57.1, 124.3)
y = (3.4, 59.2)
area = 2154.8

you can see what seaborn currently does by searching for "bivariate" in distributions.py (e.g. functions _bivariate_kdeplot and _scipy_bivariate_kde). the reason I say this is "better" than getting contour data from matplotlib is because I'm basically doing the same thing, just with a higher resolution grid (the above samples it 25 times as much).

the summaries should mostly be obvious, except maybe the "area" one which is like a monte-carlo estimate (that page animates a demo calculating pi) except we're sampling on a regular grid so the error will be much smaller.

the estimates above will be approximate, x is accurate to nearest 130 / 500 = 0.26, and y is 0.14. the area is accurate to almost 4 significant figures; I get a standard deviation of 0.30 and 95% CI of (2154.3, 2155.4). if you figure out how to calculate this using contours from matplotlib it would be great if you posted it so I could compare and see which is actually "better".

Upvotes: 4

Related Questions