mlx
mlx

Reputation: 514

How to find time difference between last row a group and first row of next group

I have a table that looks like this:

| date                | activity |
|---------------------|----------|
| 2017-03-30 01:00:00 | 1        |
| 2017-03-30 01:00:30 | 1        |
| 2017-03-30 01:01:00 | 1        |
| 2017-03-30 01:01:30 | 2        |
| 2017-03-30 01:02:00 | 2        |
| 2017-03-30 01:02:30 | 2        |
| 2017-03-30 01:03:00 | 1        |

My end goal is to derive some statistics for activity 1 from this time-series. To do so, I need to start by constructing a list that would summarize the information about the activity.

Basically, I'd like to get a tuple for each chunk (bloc of successive rows labeled 1) where the tuple would contain the start date of the chunk as well as the total number of rows in it. For the previous example, the corresponding list would be :

[(2017-03-30 01:00:00,3),(2017-03-30 01:03:00,1)]

Any idea about how this could be achieved? I'd also be open to other suggestions that would yield the same information in the end.

Upvotes: 1

Views: 399

Answers (3)

KenHBS
KenHBS

Reputation: 7164

In a bit more digestible, but probably less pandas-elegant form:

  • First you create some identifier to distinguish between different 'runs' of activity equal to 1
  • Then you remove all records where activity is not equal to 1
  • Now the hardest logic is done and we can just use a straightforward groupby

If you then want to get the list of tuples you are looking for, you can use .iterrows() at the end:

df['id'] = (df['activity'].shift(1) != df['activity']).cumsum()

inds = df['activity'] == 1
df = df.loc[inds, :]

result = df.groupby('id')['date'].agg(['min', 'size])
result
# id size   min 
#  1    3   2017-03-30 01:00:00
#  3    1   2017-03-30 01:03:00    

For the list of tuples, you can then do:

[(row[1][1], row[1][0]) for row in result.iterrows()]

Upvotes: 1

ALollz
ALollz

Reputation: 59519

Finding groups with the same consecutive value is accomplished by cumsum + a shift comparison. Use where to ignore groups that you don't care about.

#df = df.sort_values('date')

s = df.activity.ne(df.activity.shift(1)).cumsum()
res = df.groupby(s.where(df.activity.eq(1)).rename(None)).date.agg(['first', 'size'])

Output:

                   first  size
1.0  2017-03-30 01:00:00     3
3.0  2017-03-30 01:03:00     1

If you really want the list of tuples, then:

[tuple(x) for x in res.to_numpy()]
#[('2017-03-30 01:00:00', 3), ('2017-03-30 01:03:00', 1)]

Upvotes: 2

rpanai
rpanai

Reputation: 13437

You can first assign a number to each group and then use groupby. The first part is not that pythonic but it works:

import pandas as pd
 df = {'date': {0: '2017-03-30 01:00:00',
  1: '2017-03-30 01:00:30',
  2: '2017-03-30 01:01:00',
  3: '2017-03-30 01:01:30',
  4: '2017-03-30 01:02:00',
  5: '2017-03-30 01:02:30',
  6: '2017-03-30 01:03:00'},
 'activity': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 1}}

df = pd.DataFrame(df)

# add group
group = 0
groups = []
initial_value = df.iloc[0]["activity"]
for _, row in df.iterrows():
    if row["activity"]!= initial_value:
        initial_value = row["activity"]
        group +=1
    groups.append(group)

df["group"] = groups

# count and min date
out = df.groupby(["group", "activity"])\
        .agg({"date":{"min", "count"}})

out.columns = ["_".join(o) for o in out.columns]
out = out.reset_index()

Upvotes: 0

Related Questions