Reputation: 5239
I would like to have a unique DateTimeIndex
in a dataframe. Therefore, I would like to merge two rows with the same index into one row. During this merge, I would like to apply a custom formular (such as avg/mean
, max
, min
) to each column.
Idea formulated in data:
A Dataframe with times series data
df_orig
+----------------------------+-------+--------+
| | value | value2 |
| ts | | |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.711111 | 10 | 10 |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.766666 | 1 | 3 |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.766666 | 2 | 6 |
+----------------------------+-------+--------+
| 2019-09-17T09:39:39.010101 | 2 | 4 |
+----------------------------+-------+--------+
i would try to mark those rows, which index is not unique. Like this:
df_marked_duplicated
+----------------------------+-------+--------+-----------+
| | value | value2 | is_unique |
| ts | | | |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.711111 | 10 | 10 | 1 |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 1 | 3 | 0 |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 2 | 6 | 0 |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:39.010101 | 2 | 4 | 1 |
+----------------------------+-------+--------+-----------+
so that I can later calculate the avg
on value
and the max
on value2
. So the target df looks like:
df_target
+----------------------------+-------+--------+-----------+
| | value | value2 | is_unique |
| ts | | | |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.711111 | 10 | 10 | 1 |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 1.5 | 6 | 0 |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:39.010101 | 2 | 4 | 1 |
+----------------------------+-------+--------+-----------+
I am not sure if pd.groupby
isnt a bit overkill, because - from what I see - it forces me into iterating over it later. I am also interested into seeing how deduplication works in pandas more closely.
Thanks!
Upvotes: 0
Views: 191
Reputation: 1244
Pandas implementation of aggregate
functions on groupby
objects is performance efficient than any custom implementation. You can also apply different aggregate methods for different columns with dict
of functions.
so in your case
df.groupby(['ts'].agg({'value': 'mean', 'value2': 'max'})
Upvotes: 0
Reputation: 150785
Isn't it just groupby
:
df.groupby('ts').agg({'value':'mean', 'value2':'max'})
Upvotes: 1