Reputation: 43
My dataframe looks like this
time Values
0 1 [11, 10]
1 2 [10, 17]
2 1 [13, 14]
3 2 [14, 15]
My goal is to average the rows that have the same time value, so it looks like this.
time Values
0 1 [12, 12]
1 2 [12, 16]
I tried to use the below code but that didn't work because the values are arrays. However, I would like to keep them as arrays.
df = df.groupby('time', as_index=False).mean()
pandas.core.base.DataError: No numeric types to aggregate
I was able to solve the problem using the below set of code, but I was wondering is there anything easier?
# Separate array into their own columns
df2 = pd.DataFrame(df['Values'].to_list()) # new array of just separate columns
df = pd.concat([df['time'], df2], axis=1) # add separate columns to time column
# Average values of duplicate entries
df = df.groupby('time', as_index=False).mean()
# Go back to old format
df_timeless = df[df.columns[1:]] # data_frame without time to make combination step easier
df['Values'] = df_timeless.values.tolist() # combine all values into an array
df = df.drop(df.iloc[:, 1:-1], axis=1) # get rid of separated values, so it's only the array
Upvotes: 1
Views: 839
Reputation: 59549
pandas
is meant for simple scalar data -- having a column of lists is going to overly complicate your life and severely impact performance.
A better organization would be multiple columns with one that corresponds to the index of the list position, that way you can store the scalar values. It takes some work to get to that organization, but once you're there it's a simple groupby
to do your calculation.
df1 = df.explode('Values')
df1['idx'] = df1.groupby(level=0).cumcount()
df1['Values'] = pd.to_numeric(df1['Values'])
# time Values idx
#0 1 11 0
#0 1 10 1
#1 2 10 0
#1 2 17 1
#2 1 13 0
#2 1 14 1
#3 2 14 0
#3 2 15 1
df1.groupby(['time', 'idx']).mean()
# Values
#time idx
#1 0 12
# 1 12
#2 0 12
# 1 16
If you really needed the lists you could agg back to those, but wouldn't recommend
df1.groupby(['time', 'idx']).mean().groupby(level='time').agg(list)
# Values
#time
#1 [12, 12]
#2 [12, 16]
Upvotes: 3
Reputation: 2442
You can use groupby
with lambda
function and apply mean
method to average out rows-
from statistics import mean
df = df.groupby(['time'])['Values'].apply(lambda x:list(map(mean, zip(*x)))).reset_index()
print(df)
Output-
time Values
0 1 [12, 12]
1 2 [12, 16]
Upvotes: 3