Reputation: 514
I have a table that looks like this:
| date | activity |
|---------------------|----------|
| 2017-03-30 01:00:00 | 1 |
| 2017-03-30 01:00:30 | 1 |
| 2017-03-30 01:01:00 | 1 |
| 2017-03-30 01:01:30 | 2 |
| 2017-03-30 01:02:00 | 2 |
| 2017-03-30 01:02:30 | 2 |
| 2017-03-30 01:03:00 | 1 |
My end goal is to derive some statistics for activity 1 from this time-series. To do so, I need to start by constructing a list that would summarize the information about the activity.
Basically, I'd like to get a tuple for each chunk (bloc of successive rows labeled 1) where the tuple would contain the start date of the chunk as well as the total number of rows in it. For the previous example, the corresponding list would be :
[(2017-03-30 01:00:00,3),(2017-03-30 01:03:00,1)]
Any idea about how this could be achieved? I'd also be open to other suggestions that would yield the same information in the end.
Upvotes: 1
Views: 399
Reputation: 7164
In a bit more digestible, but probably less pandas-elegant form:
If you then want to get the list of tuples you are looking for, you can use .iterrows()
at the end:
df['id'] = (df['activity'].shift(1) != df['activity']).cumsum()
inds = df['activity'] == 1
df = df.loc[inds, :]
result = df.groupby('id')['date'].agg(['min', 'size])
result
# id size min
# 1 3 2017-03-30 01:00:00
# 3 1 2017-03-30 01:03:00
For the list of tuples, you can then do:
[(row[1][1], row[1][0]) for row in result.iterrows()]
Upvotes: 1
Reputation: 59519
Finding groups with the same consecutive value is accomplished by cumsum
+ a shift
comparison. Use where
to ignore groups that you don't care about.
#df = df.sort_values('date')
s = df.activity.ne(df.activity.shift(1)).cumsum()
res = df.groupby(s.where(df.activity.eq(1)).rename(None)).date.agg(['first', 'size'])
Output:
first size
1.0 2017-03-30 01:00:00 3
3.0 2017-03-30 01:03:00 1
If you really want the list of tuples, then:
[tuple(x) for x in res.to_numpy()]
#[('2017-03-30 01:00:00', 3), ('2017-03-30 01:03:00', 1)]
Upvotes: 2
Reputation: 13437
You can first assign a number to each group and then use groupby
. The first part is not that pythonic but it works:
import pandas as pd
df = {'date': {0: '2017-03-30 01:00:00',
1: '2017-03-30 01:00:30',
2: '2017-03-30 01:01:00',
3: '2017-03-30 01:01:30',
4: '2017-03-30 01:02:00',
5: '2017-03-30 01:02:30',
6: '2017-03-30 01:03:00'},
'activity': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 1}}
df = pd.DataFrame(df)
# add group
group = 0
groups = []
initial_value = df.iloc[0]["activity"]
for _, row in df.iterrows():
if row["activity"]!= initial_value:
initial_value = row["activity"]
group +=1
groups.append(group)
df["group"] = groups
# count and min date
out = df.groupby(["group", "activity"])\
.agg({"date":{"min", "count"}})
out.columns = ["_".join(o) for o in out.columns]
out = out.reset_index()
Upvotes: 0