Increasing Speed, Efficiency with Pandas in vectorized loop implementation

Question

I have a (7.5MM,17) dataframe with 500k unique items that I am "de-pivoting", for lack of a better word: each unique item can have N amount of rows associated across various traits, which I am aggregating into columns such that each unique item has only one row.

In order to do this I have a for loop iterating over each unique item in the dataframe. It sorts the dataframe into a temporary dataframe filtered to only rows represented by said unique item. I then apply

df.loc[df['trait']=='xyz'].sum()

for 8 traits, each generating an aggregated variable. These variables are then appended into a temporary list (loop internal), which is in turn appended to the output list (loop external).

At the end of each loop the base dataframe removes all rows related to the unique item, such that as the output list grows the storage size of the base dataframe is reduced (factor of ~15).

I expected this to run extremely quickly, as to the best of my knowledge the manipulation inside of the for loop is a vectorized implementation. However, after an hour it has only gotten through ~7000 rows. That sets the runtime expectation at just shy of 3 days. I can live with that, but am hoping for some insight into a more efficient solution.

The machine I am running this on is has 32GB RAM, .5TB storage. This program it only taking up ~5GB of RAM. Is there a way to use more RAM but move faster?

Any insight will be much appreciated.

edit -

wolst = []
cnt = 0

for i in wodf['WON'].unique().tolist():
    tlst = []
    wo = i

    tdf = wodf.loc[wodf['WON']==i]

    matsum = tdf.loc[tdf['LIT']=="M",'LIC'].sum()
    labsum = tdf.loc[tdf['LIT']=="L", 'LIC'].sum()
    labhrs = tdf.loc[tdf['LIT']=="L", 'Q'].sum()
    tcsum = tdf.loc[tdf['LIT']=="T", 'LIC'].sum()
    numtrp = tdf.loc[tdf['LIT']=="T", 'Q'].sum()
    pmusum = tdf.loc[tdf['LIT']=="PM", 'LIC'].sum()
    prtsum = tdf.loc[tdf['LIT']=="P", 'LIC'].sum()
    stdsum = tdf.loc[tdf['LIT']=="S", 'LIC'].sum()
    stdhrs = tdf.loc[tdf['LIT']=="S", 'Q'].sum()

    labsum = labsum+stdsum
    labhrs = labhrs+stdhrs

    if labsum is None:
        labsum = 0
    if labhrs is None:
        labhrs = 0
    if matsum is None:
        matsum=0
    if tcsum is None:
        tcsum=0
    if numtrp is None:
        numtrp=0
    if pmusum is None:
        pmusum=0
    if prtsum is None:
        prtsum=0

    tlst.append([wo,labsum,labhrs,matsum,tcsum,numtrp,pmusum,prtsum])
    wolst.append(tlst)

    print(cnt)
    cnt+=1

Where wodf is (7.5MM,17)

lexual · Accepted Answer

You are essentially grouping by 'WON' & 'LIT', and summing up the 'LIC' & 'Q'?

Here's an example of groupby can do something similar.

In [42]: df = pd.DataFrame({'WON' : ['foo', 'bar', 'foo', 'bar', 
    ...:    ...:                           'foo', 'bar', 'foo', 'foo'], 
    ...:    ...:                    'LIT' : ['one', 'one', 'two', 'three', 
    ...:    ...:                           'two', 'two', 'one', 'three'], 
    ...:    ...:                    'LIC' : np.random.randn(8), 
    ...:    ...:                    'Q' : np.random.randn(8)})                             

In [43]: df                                                                                
Out[43]: 
   WON    LIT       LIC         Q
0  foo    one  0.148776  1.963984
1  bar    one  0.008530 -0.494986
2  foo    two  0.218419  0.384919
3  bar  three  0.944845 -0.185242
4  foo    two  0.218473  1.505220
5  bar    two  0.669328  0.146424
6  foo    one -0.861758  0.482464
7  foo  three -0.627680  1.604041

In [44]: df.groupby(['WON', 'LIT'])['LIC', 'Q'].sum().unstack()                            
Out[44]: 
          LIC                             Q                    
LIT       one     three       two       one     three       two
WON                                                            
bar  0.008530  0.944845  0.669328 -0.494986 -0.185242  0.146424
foo -0.712981 -0.627680  0.436891  2.446449  1.604041  1.890139

In [45]: df.groupby(['WON', 'LIT'])['LIC', 'Q'].sum()                                      
Out[45]: 
                LIC         Q
WON LIT                      
bar one    0.008530 -0.494986
    three  0.944845 -0.185242
    two    0.669328  0.146424
foo one   -0.712981  2.446449
    three -0.627680  1.604041
    two    0.436891  1.890139

If RAM usage is an issue, you could look at the dask project, which can handle out of memory calculations using a pandas-like API

I would point you to the:

excellent Pandas documentation
Wes McKinney (author of Pandas) book Python for Data Analysis

Increasing Speed, Efficiency with Pandas in vectorized loop implementation

Answers (1)

Related Questions