Favo
Favo

Reputation: 838

Combinations of a dataframe column and index

So I have a dataframe with a list of particle trajectories with lat/lon pairs, the mass of the particle, and a cell bin to which the particle is inside at that particular time.

                         lon     lat      mass        cell_bins
time       trajectory                                           
2010-06-03 1           -96.467  24.835     0.461  8448b65ffffffff
           2           -97.574  24.006     0.460  8445693ffffffff
           3           -96.473  24.701     0.454  8448b65ffffffff
           4           -96.980  23.322     0.436  844569dffffffff
           5           -97.258  24.245     0.452  8448b69ffffffff
... ... ... ...
2010-06-04 1           -96.467  24.835     0.461  8448b65ffffffff
           2           -97.574  24.006     0.460  8445693ffffffff
           3           -96.473  24.701     0.454  8448b65ffffffff
           4           -96.980  23.322     0.436  844569dffffffff
           5           -97.258  24.245     0.452  8448b69ffffffff
....
...

What I need to do is create a new dataframe with the particles grouped by the cell_bin with the sum of the particles inside the bin, and the sum of the mass, but at each time step.

This is the output that I need:

                                                particle_sum    mass_sum
cell_bins       time        
844575dffffffff 2010-06-03T00:00:00.000000000   2   2.0
                2010-06-04T00:00:00.000000000   9   6.9708424
                2010-06-05T00:00:00.000000000   15  10.779473
                2010-06-06T00:00:00.000000000   12  8.653734
                2010-06-07T00:00:00.000000000   8   5.150217
... ... ... ...
84454b9ffffffff 2010-07-29T00:00:00.000000000   0   0.0
                2010-07-30T00:00:00.000000000   0   0.0
                2010-07-31T00:00:00.000000000   0   0.0
                2010-08-01T00:00:00.000000000   0   0.0
                2010-08-02T00:00:00.000000000   1   0.34637067

What I'm currently doing is a nested loop, by all the unique cell_bin's and all the timesteps in the source dataframe, which is kind of embarrassing because it is definitely a more efficient and pandas way to do it.

rows = []
for cellbin in df['cell_bins'].unique():
    for timestep in df.index.get_level_values(0).unique().values:    

        _fdf = df.loc[(timestep)]
        _f2df = _fdf[ _fdf['cell_bins']==cellbin ]
        rows.append( [cellbin, timestep, _f2df['mass'].size, _f2df['mass'].sum() ] )         

binsdf = pd.DataFrame(np.array(rows) , columns=['cell_bins', 'time', 'particle_sum', 'mass_sum'] )
binsdf.set_index(['cell_bins','time'], inplace=True)

I could find a way to group and make the combinations properly, can anyone provide a better strategy? Thxs

EDIT: 5-May-2020

On @jezrael request I'm including a minimal example.

In this example I'm omitting the lat/lon pairs, they are not needed for the sake of the example:

import pandas as pd

idx = pd.MultiIndex.from_product([['2010-06-03','2010-06-04','2010-06-05'],
                                  [1, 2, 3]],
                                 names=['time', 'trajectory'])
cols = ['mass','cell_bins']
df = pd.DataFrame([[0.2,'A'],[0.4,'A'],[0.3,'A'],
                   [0.1,'A'],[0.2,'B'],[0.3,'B'],
                   [0.1,'B'],[0.2,'C'],[0.3,'C']], idx, cols)

The dataframe:

                        mass    cell_bins
time        trajectory      
2010-06-03   1           0.2    A
             2           0.4    A
             3           0.3    A
2010-06-04   1           0.1    A
             2           0.2    B
             3           0.3    B
2010-06-05   1           0.1    B
             2           0.2    C
             3           0.3    C

The output that I need to create is:

                    particle_sum    mass_sum
cell_bins  time     
A         2010-06-03    3     0.9
          2010-06-04    1     0.1
          2010-06-05    0     0.0
B         2010-06-03    0     0.0
          2010-06-04    2     0.5
          2010-06-05    1     0.1
C         2010-06-03    0     0.0
          2010-06-04    0     0.0
          2010-06-05    2     0.5

For each cellbin in each timestep i need to know the number or particles, and the sum of the mass.

Upvotes: 1

Views: 239

Answers (1)

jezrael
jezrael

Reputation: 863291

I think you need aggregate by GroupBy.agg with counts by GroupBy.size and GroupBy.sum in named aggregations and then add missing combinations by DataFrame.unstack and DataFrame.stack:

df = (df.groupby(['cell_bins', 'time'])
       .agg(particle_sum = ('mass','size'),
            mass_sum= ('mass','sum'))
       .unstack(fill_value=0)
       .stack())
print (df)
                      particle_sum  mass_sum
cell_bins time                              
A         2010-06-03             3       0.9
          2010-06-04             1       0.1
          2010-06-05             0       0.0
B         2010-06-03             0       0.0
          2010-06-04             2       0.5
          2010-06-05             1       0.1
C         2010-06-03             0       0.0
          2010-06-04             0       0.0
          2010-06-05             2       0.5

Upvotes: 2

Related Questions