operations in pandas DataFrame

Question

I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:

>>> data.head()
   Hs  Tp   wd  seed  max  min
0   1   9  165    22  225   18
1   1   9  195    16  190   18
2   2   5  165    43  193   12
3   2  10  180    15  141   22
4   1   6  180    17  219   18
>>> len(data)
4500

I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.

In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:

>>> stdev.head()
  Hs Tp       max       min
0  1  5  43.31321  4.597629
1  1  6  43.20004  4.640795
2  1  7  47.31507  4.569408
3  1  8  41.75081  4.651762
4  1  9  41.35818  4.285991
>>> len(stdev)
30

The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)

import pandas as pd
import numpy as np

#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)

# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
    data_Hs = data[data['Hs'] == hs]
    for tp in set(data_Hs['Tp']):
        data_tp = data_Hs[data_Hs['Tp'] == tp]
        stdev.loc[i] = [
               hs, 
               tp, 
               max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]), 
               max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
        i+=1

Thanks!

PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.

mdurant · Accepted Answer

This could be a one-liner, if I understood you correctly:

data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])

(include reset_index() on the end if you want)

operations in pandas DataFrame

Answers (1)

Related Questions