laszlopanaflex
laszlopanaflex

Reputation: 1916

groupby / aggregate dataframe with non-numeric types

I have the following situation

date_range = pd.date_range('20180101', '20180105')
date_list = list(itertools.chain.from_iterable(itertools.repeat(date, 2) for date in date_range))
num_list = np.random.randint(1,100,size=(10))
date2 = ['2018-12-31']*10

df = pd.DataFrame({'date1':date_list,'numbers':num_list,'date2':date2})

displaying this dataframe gives

      date1        date2    numbers
0   2018-01-01  2018-12-31  38
1   2018-01-01  2018-12-31  2
2   2018-01-02  2018-12-31  8
3   2018-01-02  2018-12-31  51
4   2018-01-03  2018-12-31  16
5   2018-01-03  2018-12-31  22
6   2018-01-04  2018-12-31  43
7   2018-01-04  2018-12-31  76
8   2018-01-05  2018-12-31  47
9   2018-01-05  2018-12-31  50

i would like to obtain a new dataframe which is a) grouped by date1, b) sums up the values for each date1 in the numbers column, and c) keeps the date2 value (which we can assume would be the same for each date1 or, in this case, the same for the entire dataframe

i can do the following to achieve a+b, but if i try to include something like 'date2':'mean' in the aggregation dictionary it will not work and return DataError: No numeric types to aggregate

df.groupby(['date1'],as_index=False).agg({'numbers':'sum'})

any advice?

Upvotes: 4

Views: 8001

Answers (2)

jezrael
jezrael

Reputation: 862611

It seems you need if date2 is same for each group:

df.groupby(['date1', 'date2'],as_index=False).agg({'numbers':'sum'})

Or need aggregate by first:

df.groupby(['date1'],as_index=False).agg({'numbers':'sum','date2':'first'})

But if need mean of datetime it is a bit complicated:

df['date2'] = pd.to_datetime(df['date2'])
f = lambda x: pd.to_datetime(x.values.astype(np.int64).mean())
df1 = df.groupby(['date1'],as_index=False).agg({'numbers':'sum','date2':f})
print (df1)
       date1  numbers      date2
0 2018-01-01      159 2018-12-31
1 2018-01-02      104 2018-12-31
2 2018-01-03       75 2018-12-31
3 2018-01-04       98 2018-12-31
4 2018-01-05      184 2018-12-31

Upvotes: 3

André C. Andersen
André C. Andersen

Reputation: 9375

Because of c):

keeps the date2 value (which we can assume would be the same for each date1 or, in this case, the same for the entire dataframe

... you can add it to the group by clause:

df.groupby(['date1', 'date2'],as_index=False).agg({'numbers':'sum'})

... giving:

       date1       date2  numbers
0 2018-01-01  2018-12-31       35
1 2018-01-02  2018-12-31       52
2 2018-01-03  2018-12-31      104
3 2018-01-04  2018-12-31       96
4 2018-01-05  2018-12-31      151

Upvotes: 0

Related Questions