Bob Fang
Bob Fang

Reputation: 7381

Pandas groupby.apply tries to preserve the original dataframe strucutre

I have the following dataframe:

In [11]: import numpy as np
    ...: import pandas as pd
    ...: df = pd.DataFrame(np.random.random(size=(10,10)), index=range(10), columns=range(10))
    ...: cols = pd.MultiIndex.from_product([['a', 'b', 'c', 'd', 'e'], ['m', 'n']], names=['l1', 'l2'])
    ...: df.columns = cols

In [12]: df
Out[12]:
l1         a                   b                   c                   d                   e
l2         m         n         m         n         m         n         m         n         m         n
0   0.257448  0.207198  0.443456  0.553674  0.765539  0.428972  0.587296  0.942761  0.115083  0.073907
1   0.099647  0.702320  0.792053  0.409488  0.112574  0.435044  0.767640  0.946108  0.257002  0.286178
2   0.110061  0.058266  0.350634  0.657057  0.900674  0.882870  0.250355  0.861289  0.041383  0.981890
3   0.408866  0.042692  0.726473  0.482945  0.030925  0.337217  0.377866  0.095778  0.033939  0.550848
4   0.255034  0.455349  0.193223  0.377962  0.445834  0.400846  0.725098  0.567926  0.052293  0.471593
5   0.133966  0.239252  0.479669  0.678660  0.146475  0.042264  0.929615  0.873308  0.603774  0.788071
6   0.068064  0.849320  0.786785  0.767797  0.534253  0.348995  0.267851  0.838200  0.351832  0.566974
7   0.240924  0.089154  0.161263  0.179304  0.077933  0.846366  0.916394  0.771528  0.798970  0.942207
8   0.808719  0.737900  0.300483  0.205682  0.073342  0.081998  0.002116  0.550923  0.460010  0.650109
9   0.413887  0.671698  0.294521  0.833841  0.002094  0.363820  0.148294  0.632994  0.278557  0.340835

And then I want to do the following groupby-apply operation.

In [17]: def func(df):
    ...:     return df.loc[:, df.columns.get_level_values('l2') == 'm']
    ...:
In [19]: df.groupby(level='l1', axis=1).apply(func)
Out[19]:
l1         a             b             c             d             e
l2         m   n         m   n         m   n         m   n         m   n
0   0.257448 NaN  0.443456 NaN  0.765539 NaN  0.587296 NaN  0.115083 NaN
1   0.099647 NaN  0.792053 NaN  0.112574 NaN  0.767640 NaN  0.257002 NaN
2   0.110061 NaN  0.350634 NaN  0.900674 NaN  0.250355 NaN  0.041383 NaN
3   0.408866 NaN  0.726473 NaN  0.030925 NaN  0.377866 NaN  0.033939 NaN
4   0.255034 NaN  0.193223 NaN  0.445834 NaN  0.725098 NaN  0.052293 NaN
5   0.133966 NaN  0.479669 NaN  0.146475 NaN  0.929615 NaN  0.603774 NaN
6   0.068064 NaN  0.786785 NaN  0.534253 NaN  0.267851 NaN  0.351832 NaN
7   0.240924 NaN  0.161263 NaN  0.077933 NaN  0.916394 NaN  0.798970 NaN
8   0.808719 NaN  0.300483 NaN  0.073342 NaN  0.002116 NaN  0.460010 NaN
9   0.413887 NaN  0.294521 NaN  0.002094 NaN  0.148294 NaN  0.278557 NaN


Notice that even if I do not retun any data for columns with l2=='n', the structure of the original dataframe is still preserved and pandas automatically fill in the values with nan.

This is a simplified example, my intention here is not to select out the 'm' columns, this example is just for a illustration of the problem I am facing -- I want to apply some function on some subset of the columns in the dataframe and the result dataframe should only have the columns I care about.

Also I noticed that you cannot rename the column in the apply function. For example if you do:

In [25]: def func(df):
    ...:     df = df.loc[:, df.columns.get_level_values('l2') == 'm']
    ...:     df = df.rename(columns={'m':'p'}, level=1)
    ...:     return df
    ...:

In [26]: df.groupby(level='l1', axis=1).apply(func)
Out[26]:
l1   a       b       c       d       e
l2   m   n   m   n   m   n   m   n   m   n
0  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Notice the result is full of NaN but the original format of the DF is preserved.

My question is, what should I do so that in the applied function I can manipulate the df so the output of the apply can be different in shape compared to the original df?

Upvotes: 0

Views: 143

Answers (1)

ralex
ralex

Reputation: 388

Read "What is the difference between pandas agg and apply function?". Depending on your actual use case, you may not need to change the function being passed into .agg or .apply.

I want to apply some function on some subset of the columns in the dataframe

You can shape the DataFrame before grouping, or return only a subset of e.g. columns with the desired aggregation or function application.

# pass an indexed view
grouped0 = df.loc[:, ['a', 'b', 'c'].groupby(level='l1', axis=1)

# perform the .agg or .apply on a subset of e.g. columns
result1 = df.groupby(level='l1', axis=1)['a', 'b', 'c'].agg(np.sum)

Using .agg on your example code:

In [2]: df
Out[2]:
l1         a                   b            ...         d                   e
l2         m         n         m         n  ...         m         n         m         n
0   0.007932  0.697320  0.181242  0.380013  ...  0.075391  0.820732  0.335901  0.808365
1   0.736584  0.621418  0.736926  0.962414  ...  0.331465  0.711948  0.426704  0.849730
2   0.099217  0.802882  0.082109  0.489288  ...  0.758056  0.627021  0.539329  0.808187
3   0.152319  0.378918  0.205193  0.489060  ...  0.337615  0.475191  0.025432  0.616413
4   0.582070  0.709464  0.739957  0.472041  ...  0.299662  0.151314  0.113506  0.504926
5   0.351747  0.480518  0.424127  0.364428  ...  0.267780  0.092946  0.134434  0.443320
6   0.572375  0.157129  0.582345  0.124572  ...  0.074523  0.421519  0.733218  0.079004
7   0.026940  0.762937  0.108213  0.073087  ...  0.758596  0.559506  0.601568  0.603528
8   0.991940  0.864772  0.759207  0.523460  ...  0.981770  0.332174  0.012079  0.034952

In [4]: df.groupby(level='l1', axis=1).sum()
Out[4]:
l1         a         b         c         d         e
0   0.705252  0.561255  0.804299  0.896123  1.144266
1   1.358002  1.699341  1.422559  1.043413  1.276435
2   0.902099  0.571397  0.273161  1.385077  1.347516
3   0.531237  0.694253  0.914989  0.812806  0.641845
4   1.291534  1.211998  1.138044  0.450976  0.618433
5   0.832265  0.788555  1.063437  0.360726  0.577754
6   0.729504  0.706917  1.018795  0.496042  0.812222
7   0.789877  0.181300  0.406009  1.318102  1.205095
8   1.856713  1.282666  1.183835  1.313944  0.047031
9   0.273369  0.391189  0.867865  0.978350  0.654145

In [10]: df.groupby(level='l1', axis=1).agg(lambda x: x[0])
Out[10]:
l1         a         b         c         d         e
0   0.007932  0.181242  0.708712  0.075391  0.335901
1   0.736584  0.736926  0.476286  0.331465  0.426704
2   0.099217  0.082109  0.037351  0.758056  0.539329
3   0.152319  0.205193  0.419761  0.337615  0.025432
4   0.582070  0.739957  0.279153  0.299662  0.113506
5   0.351747  0.424127  0.845485  0.267780  0.134434
6   0.572375  0.582345  0.309942  0.074523  0.733218
7   0.026940  0.108213  0.084424  0.758596  0.601568
8   0.991940  0.759207  0.412974  0.981770  0.012079
9   0.045315  0.282569  0.019320  0.638741  0.292028

In [11]: df.groupby(level='l1', axis=1).agg(lambda x: x[1])
Out[11]:
l1         a         b         c         d         e
0   0.697320  0.380013  0.095587  0.820732  0.808365
1   0.621418  0.962414  0.946274  0.711948  0.849730
2   0.802882  0.489288  0.235810  0.627021  0.808187
3   0.378918  0.489060  0.495227  0.475191  0.616413
4   0.709464  0.472041  0.858891  0.151314  0.504926
5   0.480518  0.364428  0.217953  0.092946  0.443320
6   0.157129  0.124572  0.708853  0.421519  0.079004
7   0.762937  0.073087  0.321585  0.559506  0.603528
8   0.864772  0.523460  0.770861  0.332174  0.034952
9   0.228054  0.108620  0.848545  0.339609  0.362117

Since you say that your example func is not your use case, please provide an example of your specific use case if the general cases don't fit.

Upvotes: 1

Related Questions