gies0r
gies0r

Reputation: 5239

Merge rows in one dataframe which share the same index

I would like to have a unique DateTimeIndex in a dataframe. Therefore, I would like to merge two rows with the same index into one row. During this merge, I would like to apply a custom formular (such as avg/mean, max, min) to each column.

Idea formulated in data:

A Dataframe with times series data

df_orig
+----------------------------+-------+--------+
|                            | value | value2 |
| ts                         |       |        |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.711111 | 10    | 10     |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.766666 | 1     | 3      |
+----------------------------+-------+--------+
| 2019-09-17T09:39:38.766666 | 2     | 6      |
+----------------------------+-------+--------+
| 2019-09-17T09:39:39.010101 | 2     | 4      |
+----------------------------+-------+--------+

i would try to mark those rows, which index is not unique. Like this:

df_marked_duplicated
+----------------------------+-------+--------+-----------+
|                            | value | value2 | is_unique |
| ts                         |       |        |           |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.711111 | 10    | 10     | 1         |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 1     | 3      | 0         |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 2     | 6      | 0         |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:39.010101 | 2     | 4      | 1         |
+----------------------------+-------+--------+-----------+

so that I can later calculate the avg on valueand the max on value2. So the target df looks like:

 df_target
+----------------------------+-------+--------+-----------+
|                            | value | value2 | is_unique |
| ts                         |       |        |           |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.711111 | 10    | 10     | 1         |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:38.766666 | 1.5   | 6      | 0         |
+----------------------------+-------+--------+-----------+
| 2019-09-17T09:39:39.010101 | 2     | 4      | 1         |
+----------------------------+-------+--------+-----------+

I am not sure if pd.groupby isnt a bit overkill, because - from what I see - it forces me into iterating over it later. I am also interested into seeing how deduplication works in pandas more closely.

Thanks!

Upvotes: 0

Views: 191

Answers (2)

DOOM
DOOM

Reputation: 1244

Pandas implementation of aggregate functions on groupby objects is performance efficient than any custom implementation. You can also apply different aggregate methods for different columns with dict of functions.

so in your case


df.groupby(['ts'].agg({'value': 'mean', 'value2': 'max'})

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

Isn't it just groupby:

df.groupby('ts').agg({'value':'mean', 'value2':'max'})

Upvotes: 1

Related Questions