gustavz
gustavz

Reputation: 3160

filter multi-indexed grouped pandas dataframe

The data looks like the following:

id  timestamp   date        value
1   2001-01-01  2001-05-01  0
1   2001-10-01  2001-05-01  1
2   2001-01-01  2001-05-01  0
2   2001-10-01  2001-05-01  0

as you see the table contains the columns id, timestamp, date and value. Every row with the same id also has the same date. Furthermore date is timewise always somewhere in between the first and the last timestamp of each id.

The task is to filter the table in the way to remove every id which does not contain at least one entry with value > 0 at a timestamp after their individual date.

I implemented it in the way that I multi-index the table with level 0 = id and level 1 = date and sort it. Then I group it by level 0. Next I loop through every group (id) and assign a new value telling me if the id is "good" (boolean). Finally I filter the table where good is True.

Unfortunately this implementation is slow like hell for a big (>10M rows) dataset. I am looking for a way to speed this up. My idea was using groupby.apply(lambda g: something) but I did not get it to work and I do not know if this is the fastest option possible.

Working Code Example:

import pandas as pd

df = pd.DataFrame({'id': [1, 1, 2, 2],
                   'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'], 
                   'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
                   'value': [0, 1, 0, 0]})                               

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
    index = df_id.index
    df_id = df_id.droplevel(0)
    df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]

Upvotes: 1

Views: 80

Answers (1)

jezrael
jezrael

Reputation: 863186

For get all ids by 1 in value column and also timestamp are higher like date create 2 masks by Series.gt, chain by & for bitwise AND and then test if at least one True per group by GroupBy.any and GroupBy.transform:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])

m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
   id  timestamp       date  value
0   1 2001-01-01 2001-01-05      0
1   1 2001-01-10 2001-01-05      1

Upvotes: 2

Related Questions