Reputation: 639
I would like to retrieve the different portions of the below dataframe based on the condition LabelId=1. In other words, given the following dataframe:
DF_input:
eventTime velocity LabelId
1 2017-08-19 12:53:55.050 3 0
2 2017-08-19 12:53:55.100 4 1
3 2017-08-19 12:53:55.150 180 1
4 2017-08-19 12:53:55.200 2 1
5 2017-08-19 12:53:55.250 5 0
6 2017-08-19 12:53:55.050 3 0
7 2017-08-19 12:53:55.100 4 1
8 2017-08-19 12:53:55.150 70 1
9 2017-08-19 12:53:55.200 2 1
10 2017-08-19 12:53:55.250 5 0
DF_output1
eventTime velocity LabelId
2 2017-08-19 12:53:55.100 4 1
3 2017-08-19 12:53:55.150 180 1
4 2017-08-19 12:53:55.200 2 1
DF_output_2
eventTime velocity LabelId
7 2017-08-19 12:53:55.100 4 1
8 2017-08-19 12:53:55.150 70 1
9 2017-08-19 12:53:55.200 2 1
My attempt was using the condition DF_input["LabelId"] == 1 but it returns all the rows in one dataframe. So I cannot distinguish between the two subset.
Upvotes: 1
Views: 61
Reputation: 68126
You don't need loops, just some tricky logic with cumulative sums:
from io import StringIO
import numpy
import pandas
data = StringIO("""\
eventTime velocity LabelId
2017-08-19 12:53:55.050 3 0
2017-08-19 12:53:55.100 4 1
2017-08-19 12:53:55.150 180 1
2017-08-19 12:53:55.200 2 1
2017-08-19 12:53:55.250 5 0
2017-08-19 12:53:55.050 3 0
2017-08-19 12:53:55.100 4 1
2017-08-19 12:53:55.150 70 1
2017-08-19 12:53:55.200 2 1
2017-08-19 12:53:55.250 5 0
""")
df = (
pandas.read_table(data, sep='\s\s+')
.assign(diff=lambda df: df['LabelId'].diff())
.assign(group=lambda df: numpy.where(
(df['diff'] == 1).cumsum() == (df['diff'].shift(-1) == -1).shift(1).cumsum(),
0,
(df['diff'] == 1).cumsum()
))
.query("group > 0")
.drop(columns='diff')
)
And then with e.g.,
print(df[df['group'] == 1])
you get:
eventTime velocity LabelId group
1 2017-08-19 12:53:55.100 4 1 1
2 2017-08-19 12:53:55.150 180 1 1
3 2017-08-19 12:53:55.200 2 1 1
Upvotes: 0
Reputation: 164623
This is one way, but a bit messy.
from itertools import groupby
import numpy as np
acc = np.cumsum([len(list(g)) for k, g in groupby(df['LabelId'])])
i = [(a, b) for a, b in zip(acc, acc[1:])][::2]
dfs = [df.iloc[m:n, :] for m, n in i]
# [ velocity LabelId
# 1 4 1
# 2 180 1
# 3 2 1,
# velocity LabelId
# 6 4 1
# 7 70 1
# 8 2 1]
Upvotes: 0
Reputation: 747
if not a big dataframe you can do something simple like:
portion = []
for ix, row in df.iterrows():
if df.loc[ix - 1, 'Labelid']==0 and row['Labelid']==0:
portion.append(row)
else:
# do stuff on the portion
portion = []
Upvotes: 1
Reputation: 323226
Something like
l=[ None if df1[df1.LabelId==1].empty else df1[df1.LabelId==1] for _, df1 in df.groupby(df.LabelId.eq(0).cumsum())]
l
Out[402]:
[ eventTime velocity LabelId
2 2017-08-1912:53:55.100 4 1
3 2017-08-1912:53:55.150 180 1
4 2017-08-1912:53:55.200 2 1,
None,
eventTime velocity LabelId
7 2017-08-1912:53:55.100 4 1
8 2017-08-1912:53:55.150 70 1
9 2017-08-1912:53:55.200 2 1,
None]
New group key detail
df.LabelId.eq(0).cumsum()
Out[398]:
1 1
2 1
3 1
4 1
5 2
6 3
7 3
8 3
9 3
10 4
Name: LabelId, dtype: int32
Upvotes: 3