Jack Cahill
Jack Cahill

Reputation: 43

Pandas: calculating mean values of duplicate entries in a dataframe, but values are arrays

My dataframe looks like this

  time    Values
0    1  [11, 10]
1    2  [10, 17]
2    1  [13, 14]
3    2  [14, 15]

My goal is to average the rows that have the same time value, so it looks like this.

  time    Values
0    1  [12, 12]
1    2  [12, 16]

I tried to use the below code but that didn't work because the values are arrays. However, I would like to keep them as arrays.

df = df.groupby('time', as_index=False).mean()
pandas.core.base.DataError: No numeric types to aggregate

I was able to solve the problem using the below set of code, but I was wondering is there anything easier?

# Separate array into their own columns
df2 = pd.DataFrame(df['Values'].to_list())  # new array of just separate columns
df = pd.concat([df['time'], df2], axis=1)  # add separate columns to time column

# Average values of duplicate entries
df = df.groupby('time', as_index=False).mean()

# Go back to old format
df_timeless = df[df.columns[1:]]  # data_frame without time to make combination step easier
df['Values'] = df_timeless.values.tolist()  # combine all values into an array
df = df.drop(df.iloc[:, 1:-1], axis=1)  # get rid of separated values, so it's only the array

Upvotes: 1

Views: 839

Answers (2)

ALollz
ALollz

Reputation: 59549

pandas is meant for simple scalar data -- having a column of lists is going to overly complicate your life and severely impact performance.

A better organization would be multiple columns with one that corresponds to the index of the list position, that way you can store the scalar values. It takes some work to get to that organization, but once you're there it's a simple groupby to do your calculation.

df1 = df.explode('Values')
df1['idx'] = df1.groupby(level=0).cumcount()
df1['Values'] = pd.to_numeric(df1['Values'])
#   time  Values  idx
#0     1      11    0
#0     1      10    1
#1     2      10    0
#1     2      17    1
#2     1      13    0
#2     1      14    1
#3     2      14    0
#3     2      15    1

df1.groupby(['time', 'idx']).mean()
#          Values
#time idx        
#1    0        12
#     1        12
#2    0        12
#     1        16

If you really needed the lists you could agg back to those, but wouldn't recommend

df1.groupby(['time', 'idx']).mean().groupby(level='time').agg(list)
#        Values
#time          
#1     [12, 12]
#2     [12, 16]

Upvotes: 3

Shradha
Shradha

Reputation: 2442

You can use groupby with lambda function and apply mean method to average out rows-

from statistics import mean
df = df.groupby(['time'])['Values'].apply(lambda x:list(map(mean, zip(*x)))).reset_index()
print(df)

Output-

  time    Values
0    1  [12, 12]
1    2  [12, 16]

Upvotes: 3

Related Questions