Reputation: 85
I have a (7.5MM,17) dataframe with 500k unique items that I am "de-pivoting", for lack of a better word: each unique item can have N amount of rows associated across various traits, which I am aggregating into columns such that each unique item has only one row.
In order to do this I have a for loop iterating over each unique item in the dataframe. It sorts the dataframe into a temporary dataframe filtered to only rows represented by said unique item. I then apply
df.loc[df['trait']=='xyz'].sum()
for 8 traits, each generating an aggregated variable. These variables are then appended into a temporary list (loop internal), which is in turn appended to the output list (loop external).
At the end of each loop the base dataframe removes all rows related to the unique item, such that as the output list grows the storage size of the base dataframe is reduced (factor of ~15).
I expected this to run extremely quickly, as to the best of my knowledge the manipulation inside of the for loop is a vectorized implementation. However, after an hour it has only gotten through ~7000 rows. That sets the runtime expectation at just shy of 3 days. I can live with that, but am hoping for some insight into a more efficient solution.
The machine I am running this on is has 32GB RAM, .5TB storage. This program it only taking up ~5GB of RAM. Is there a way to use more RAM but move faster?
Any insight will be much appreciated.
edit -
wolst = []
cnt = 0
for i in wodf['WON'].unique().tolist():
tlst = []
wo = i
tdf = wodf.loc[wodf['WON']==i]
matsum = tdf.loc[tdf['LIT']=="M",'LIC'].sum()
labsum = tdf.loc[tdf['LIT']=="L", 'LIC'].sum()
labhrs = tdf.loc[tdf['LIT']=="L", 'Q'].sum()
tcsum = tdf.loc[tdf['LIT']=="T", 'LIC'].sum()
numtrp = tdf.loc[tdf['LIT']=="T", 'Q'].sum()
pmusum = tdf.loc[tdf['LIT']=="PM", 'LIC'].sum()
prtsum = tdf.loc[tdf['LIT']=="P", 'LIC'].sum()
stdsum = tdf.loc[tdf['LIT']=="S", 'LIC'].sum()
stdhrs = tdf.loc[tdf['LIT']=="S", 'Q'].sum()
labsum = labsum+stdsum
labhrs = labhrs+stdhrs
if labsum is None:
labsum = 0
if labhrs is None:
labhrs = 0
if matsum is None:
matsum=0
if tcsum is None:
tcsum=0
if numtrp is None:
numtrp=0
if pmusum is None:
pmusum=0
if prtsum is None:
prtsum=0
tlst.append([wo,labsum,labhrs,matsum,tcsum,numtrp,pmusum,prtsum])
wolst.append(tlst)
print(cnt)
cnt+=1
Where wodf
is (7.5MM,17)
Upvotes: 0
Views: 52
Reputation: 48692
You are essentially grouping by 'WON' & 'LIT', and summing up the 'LIC' & 'Q'?
Here's an example of groupby can do something similar.
In [42]: df = pd.DataFrame({'WON' : ['foo', 'bar', 'foo', 'bar',
...: ...: 'foo', 'bar', 'foo', 'foo'],
...: ...: 'LIT' : ['one', 'one', 'two', 'three',
...: ...: 'two', 'two', 'one', 'three'],
...: ...: 'LIC' : np.random.randn(8),
...: ...: 'Q' : np.random.randn(8)})
In [43]: df
Out[43]:
WON LIT LIC Q
0 foo one 0.148776 1.963984
1 bar one 0.008530 -0.494986
2 foo two 0.218419 0.384919
3 bar three 0.944845 -0.185242
4 foo two 0.218473 1.505220
5 bar two 0.669328 0.146424
6 foo one -0.861758 0.482464
7 foo three -0.627680 1.604041
In [44]: df.groupby(['WON', 'LIT'])['LIC', 'Q'].sum().unstack()
Out[44]:
LIC Q
LIT one three two one three two
WON
bar 0.008530 0.944845 0.669328 -0.494986 -0.185242 0.146424
foo -0.712981 -0.627680 0.436891 2.446449 1.604041 1.890139
In [45]: df.groupby(['WON', 'LIT'])['LIC', 'Q'].sum()
Out[45]:
LIC Q
WON LIT
bar one 0.008530 -0.494986
three 0.944845 -0.185242
two 0.669328 0.146424
foo one -0.712981 2.446449
three -0.627680 1.604041
two 0.436891 1.890139
If RAM usage is an issue, you could look at the dask project, which can handle out of memory calculations using a pandas-like API
I would point you to the:
Upvotes: 1