Reputation: 838
So I have a dataframe with a list of particle trajectories with lat/lon pairs, the mass of the particle, and a cell bin to which the particle is inside at that particular time.
lon lat mass cell_bins
time trajectory
2010-06-03 1 -96.467 24.835 0.461 8448b65ffffffff
2 -97.574 24.006 0.460 8445693ffffffff
3 -96.473 24.701 0.454 8448b65ffffffff
4 -96.980 23.322 0.436 844569dffffffff
5 -97.258 24.245 0.452 8448b69ffffffff
... ... ... ...
2010-06-04 1 -96.467 24.835 0.461 8448b65ffffffff
2 -97.574 24.006 0.460 8445693ffffffff
3 -96.473 24.701 0.454 8448b65ffffffff
4 -96.980 23.322 0.436 844569dffffffff
5 -97.258 24.245 0.452 8448b69ffffffff
....
...
What I need to do is create a new dataframe with the particles grouped by the cell_bin with the sum of the particles inside the bin, and the sum of the mass, but at each time step.
This is the output that I need:
particle_sum mass_sum
cell_bins time
844575dffffffff 2010-06-03T00:00:00.000000000 2 2.0
2010-06-04T00:00:00.000000000 9 6.9708424
2010-06-05T00:00:00.000000000 15 10.779473
2010-06-06T00:00:00.000000000 12 8.653734
2010-06-07T00:00:00.000000000 8 5.150217
... ... ... ...
84454b9ffffffff 2010-07-29T00:00:00.000000000 0 0.0
2010-07-30T00:00:00.000000000 0 0.0
2010-07-31T00:00:00.000000000 0 0.0
2010-08-01T00:00:00.000000000 0 0.0
2010-08-02T00:00:00.000000000 1 0.34637067
What I'm currently doing is a nested loop, by all the unique cell_bin's and all the timesteps in the source dataframe, which is kind of embarrassing because it is definitely a more efficient and pandas way to do it.
rows = []
for cellbin in df['cell_bins'].unique():
for timestep in df.index.get_level_values(0).unique().values:
_fdf = df.loc[(timestep)]
_f2df = _fdf[ _fdf['cell_bins']==cellbin ]
rows.append( [cellbin, timestep, _f2df['mass'].size, _f2df['mass'].sum() ] )
binsdf = pd.DataFrame(np.array(rows) , columns=['cell_bins', 'time', 'particle_sum', 'mass_sum'] )
binsdf.set_index(['cell_bins','time'], inplace=True)
I could find a way to group and make the combinations properly, can anyone provide a better strategy? Thxs
EDIT: 5-May-2020
On @jezrael request I'm including a minimal example.
In this example I'm omitting the lat/lon pairs, they are not needed for the sake of the example:
import pandas as pd
idx = pd.MultiIndex.from_product([['2010-06-03','2010-06-04','2010-06-05'],
[1, 2, 3]],
names=['time', 'trajectory'])
cols = ['mass','cell_bins']
df = pd.DataFrame([[0.2,'A'],[0.4,'A'],[0.3,'A'],
[0.1,'A'],[0.2,'B'],[0.3,'B'],
[0.1,'B'],[0.2,'C'],[0.3,'C']], idx, cols)
The dataframe:
mass cell_bins
time trajectory
2010-06-03 1 0.2 A
2 0.4 A
3 0.3 A
2010-06-04 1 0.1 A
2 0.2 B
3 0.3 B
2010-06-05 1 0.1 B
2 0.2 C
3 0.3 C
The output that I need to create is:
particle_sum mass_sum
cell_bins time
A 2010-06-03 3 0.9
2010-06-04 1 0.1
2010-06-05 0 0.0
B 2010-06-03 0 0.0
2010-06-04 2 0.5
2010-06-05 1 0.1
C 2010-06-03 0 0.0
2010-06-04 0 0.0
2010-06-05 2 0.5
For each cellbin in each timestep i need to know the number or particles, and the sum of the mass.
Upvotes: 1
Views: 239
Reputation: 863291
I think you need aggregate by GroupBy.agg
with counts by GroupBy.size
and GroupBy.sum
in named aggregations and then add missing combinations by DataFrame.unstack
and DataFrame.stack
:
df = (df.groupby(['cell_bins', 'time'])
.agg(particle_sum = ('mass','size'),
mass_sum= ('mass','sum'))
.unstack(fill_value=0)
.stack())
print (df)
particle_sum mass_sum
cell_bins time
A 2010-06-03 3 0.9
2010-06-04 1 0.1
2010-06-05 0 0.0
B 2010-06-03 0 0.0
2010-06-04 2 0.5
2010-06-05 1 0.1
C 2010-06-03 0 0.0
2010-06-04 0 0.0
2010-06-05 2 0.5
Upvotes: 2