Carlo Allocca
Carlo Allocca

Reputation: 639

Selecting different portions of a dataframe

I would like to retrieve the different portions of the below dataframe based on the condition LabelId=1. In other words, given the following dataframe:

DF_input:

   eventTime                 velocity     LabelId
1  2017-08-19 12:53:55.050         3        0
2  2017-08-19 12:53:55.100         4        1
3  2017-08-19 12:53:55.150       180        1
4  2017-08-19 12:53:55.200         2        1
5  2017-08-19 12:53:55.250         5        0
6  2017-08-19 12:53:55.050         3        0
7  2017-08-19 12:53:55.100         4        1
8  2017-08-19 12:53:55.150        70        1
9  2017-08-19 12:53:55.200         2        1
10 2017-08-19 12:53:55.250         5        0

DF_output1

   eventTime                 velocity     LabelId 
2  2017-08-19 12:53:55.100         4        1
3  2017-08-19 12:53:55.150       180        1
4  2017-08-19 12:53:55.200         2        1

DF_output_2

eventTime                 velocity     LabelId
7  2017-08-19 12:53:55.100         4        1
8  2017-08-19 12:53:55.150        70        1
9  2017-08-19 12:53:55.200         2        1

My attempt was using the condition DF_input["LabelId"] == 1 but it returns all the rows in one dataframe. So I cannot distinguish between the two subset.

Upvotes: 1

Views: 61

Answers (4)

Paul H
Paul H

Reputation: 68126

You don't need loops, just some tricky logic with cumulative sums:

from io import StringIO

import numpy
import pandas

data = StringIO("""\
eventTime                 velocity     LabelId
2017-08-19 12:53:55.050         3        0
2017-08-19 12:53:55.100         4        1
2017-08-19 12:53:55.150       180        1
2017-08-19 12:53:55.200         2        1
2017-08-19 12:53:55.250         5        0
2017-08-19 12:53:55.050         3        0
2017-08-19 12:53:55.100         4        1
2017-08-19 12:53:55.150        70        1
2017-08-19 12:53:55.200         2        1
2017-08-19 12:53:55.250         5        0
""")

df = (
    pandas.read_table(data, sep='\s\s+')
        .assign(diff=lambda df: df['LabelId'].diff())
        .assign(group=lambda df: numpy.where(
            (df['diff'] == 1).cumsum() == (df['diff'].shift(-1) == -1).shift(1).cumsum(),
            0,
            (df['diff'] == 1).cumsum()
        ))
        .query("group > 0")
        .drop(columns='diff')
)

And then with e.g.,

print(df[df['group'] == 1])

you get:

                 eventTime  velocity  LabelId  group
1  2017-08-19 12:53:55.100         4        1      1
2  2017-08-19 12:53:55.150       180        1      1
3  2017-08-19 12:53:55.200         2        1      1

Upvotes: 0

jpp
jpp

Reputation: 164623

This is one way, but a bit messy.

from itertools import groupby
import numpy as np

acc = np.cumsum([len(list(g)) for k, g in groupby(df['LabelId'])])

i = [(a, b) for a, b in zip(acc, acc[1:])][::2]

dfs = [df.iloc[m:n, :] for m, n in i]

# [   velocity  LabelId
# 1         4        1
# 2       180        1
# 3         2        1,
#     velocity  LabelId
# 6         4        1
# 7        70        1
# 8         2        1]

Upvotes: 0

lorenzori
lorenzori

Reputation: 747

if not a big dataframe you can do something simple like:

portion = []
for ix, row in df.iterrows():
    if df.loc[ix - 1, 'Labelid']==0 and row['Labelid']==0: 
        portion.append(row)
    else:
        # do stuff on the portion
    portion = []

Upvotes: 1

BENY
BENY

Reputation: 323226

Something like

l=[ None if df1[df1.LabelId==1].empty  else df1[df1.LabelId==1] for _, df1 in df.groupby(df.LabelId.eq(0).cumsum())]
l
Out[402]: 
[                eventTime  velocity  LabelId
 2  2017-08-1912:53:55.100         4        1
 3  2017-08-1912:53:55.150       180        1
 4  2017-08-1912:53:55.200         2        1,
 None,
                 eventTime  velocity  LabelId
 7  2017-08-1912:53:55.100         4        1
 8  2017-08-1912:53:55.150        70        1
 9  2017-08-1912:53:55.200         2        1,
 None]

New group key detail

df.LabelId.eq(0).cumsum()
Out[398]: 
1     1
2     1
3     1
4     1
5     2
6     3
7     3
8     3
9     3
10    4
Name: LabelId, dtype: int32

Upvotes: 3

Related Questions