Reputation: 521
I have a large dataframe of the following structure which is simplified for the purpose of this question:
A B C D ... J K
date1 time1 1 1 ... 1 1
date2 time2 2 2 ... 2 2
Essentially the first 3 columns all represent string data i.e times and dates. The dataframe I am reading in from a csv file and the data after the times has multiple points on a single day for each column.
What I wish to do is find a way to group all the data by dates and create mean values per day which account for multiple datapoints on a single day. This is handled nicely by using group.by(dates) however I lose all of the other data not included in the dates column so the result is like so:
Before group.by().mean():
A B C D ... J K
date1 time1 1 1 ... 1 1
date2 time2 2 2 ... 2 2
date2 time3 1 1 ... 1 1
After:
A C D ... J K
date1 1 1 1 1
date2 1.5 1.5 ... 1.5 1.5
My ideal output would be to keep the data in the time column and date column whilst still creating mean values based on the days. This would result in the following:
Ideal output:
A B C D ... J K
date1 time1 1 1 1 1
date2 time2 1.5 1.5 ... 1.5 1.5
Upvotes: 0
Views: 817
Reputation: 863556
There is problem you need aggregate columns with strings
and times
too, e.g. by first
, else are omited.
So possible solution is create dict of aggregation functions
and use groupby
+ agg
+ reset_index
+ reindex_axis
:
print (df)
A B C D E J K
0 a date1 time1 1 1 1 1
1 b date2 time2 2 2 2 2
2 c date2 time3 1 1 1 1
cols = ['A','B','C']
d = {x:'mean' for x in df.columns.difference(cols)}
d['A'] = 'first'
d['C'] = 'first'
print (d)
{'E': 'mean', 'D': 'mean', 'J': 'mean', 'A': 'first', 'C': 'first', 'K': 'mean'}
df1 = df.groupby('B').agg(d).reset_index().reindex_axis(df.columns, axis=1)
print (df1)
A B C D E J K
0 a date1 time1 1.0 1.0 1.0 1.0
1 b date2 time2 1.5 1.5 1.5 1.5
Upvotes: 1