cf1
cf1

Reputation: 77

How to calculate statistical metrics directly on a PDF in Pandas DataFrame?

Say I already have a PDF (probability density function) in Pandas DataFrame.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame([1,2,3,4,5,6,5,4,3,2], index=np.linspace(21,30,10), columns=['days'])
df.index.names=['temperature']
print(df)
             days
temperature      
21.0            1
22.0            2
23.0            3
24.0            4
25.0            5
26.0            6
27.0            5
28.0            4
29.0            3
30.0            2

If I wanted to calculate metrics like skewness, I have to convert the PDF back to raw data like this:

temp_history = []
for i in df.iterrows():
    temp_history += i[1][0] * [i[0]]

print(temp_history)
[21.0, 22.0, 22.0, 23.0, 23.0, 23.0, 24.0, 24.0, 24.0, 24.0, 25.0, 25.0, 25.0, 25.0, 25.0, 26.0, 26.0, 26.0, 26.0, 26.0, 26.0, 27.0, 27.0, 27.0, 27.0, 27.0, 28.0, 28.0, 28.0, 28.0, 29.0, 29.0, 29.0, 30.0, 30.0]

skew = stats.skew(temp_history)

Is there anyway I can calculate the metrics without having to create temp_history ? Thanks!

Edit: The reason I want to avoid creating a raw data in any form is that I don't want to lose a huge chunk of memory simply when the numbers in the days column get bigger.

Upvotes: 2

Views: 218

Answers (1)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Use -

df.reindex(df.index.repeat(df['days'])).reset_index()['temperature'].skew()

OR

To stick to your original implementation -

stats.skew(df.reindex(df.index.repeat(df['days'])).reset_index()['temperature'])

And if you are wondering why the outputs won't match, it's discussed here

For matching both, set bias=False in stats.skew()

Upvotes: 2

Related Questions