Reputation:
I have a data set how I can find the average of numerical columns for each 30 rows and make a smaller data set?
For example:
data
A B C
2. 4 f
3 1 d
4 2 p
3 1 l
2 0 g
Lets I wanna find the average of each 2 rows (for simplification).
A B C
(2+3)/2 (4+1)/2 d
(4+3)/2 (2+1)/2 l
2 0 g
Upvotes: 0
Views: 37
Reputation: 35686
Try dividing the index
by n
to get groups then use groupby aggregate
:
n = 2
new_df = df.groupby(df.index // n).agg({'A': 'mean', 'B': 'mean', 'C': 'last'})
new_df
:
A B C
0 2.5 2.5 d
1 3.5 1.5 l
2 2.0 0.0 g
Or the aggregation dict
can be built programmatically by dtype
here object dtypes are aggregated by taking the last
value, all other types are aggregated with the mean
:
import numpy as np
n = 2
aggregation_d = dict(
zip(df.columns, np.where(df.dtypes != 'object', 'mean', 'last'))
)
new_df = df.groupby(df.index // n).agg(aggregation_d)
aggregation_d
:
{'A': 'mean', 'B': 'mean', 'C': 'last'}
new_df
:
A B C
0 2.5 2.5 d
1 3.5 1.5 l
2 2.0 0.0 g
Explanation:
df.index // n
Creates a Series of values:
Int64Index([0, 0, 1, 1, 2], dtype='int64')
These values establish a group.
Then the dict
of values specifies how to aggregate each group:
{'A': 'mean', 'B': 'mean', 'C': 'last'}
Take the mean
of A
and B
, and the last
value of C
.
Upvotes: 1