Manglu
Manglu

Reputation: 266

Extracting next rows in Pandas Dataframe based on column values

Suppose I have the following Dataframe

  final  raw  act    wc             Start            Finish
   abc  xyz   30    M5  17-01-2022 06:00  14-07-2031 02:36
   abc  xyz   40    F4  17-01-2022 06:00  14-07-2031 02:36
   abc  xyz   50    F6  17-01-2022 06:00  14-07-2031 02:36
   abc  xyz   60    F8  17-01-2022 06:00  14-07-2031 02:36
   abc  pqr   40  M14S  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   50   M12  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   60  M14S  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   20    F3  17-01-2022 06:00  14-07-2031 02:36
   abc  pqr   40    F4  17-01-2022 06:00  14-07-2031 02:36
   abc  pqr   50    F6  17-01-2022 06:00  14-07-2031 02:36

I would like to take the two rows from here, one is

 abc  xyz   50  F6  17-01-2022 06:00  14-07-2031 02:36

another one is

abc  pqr   50    F6  17-01-2022 06:00  14-07-2031 02:36

The logic would be that for each raw, pick up the next row where wc is either F3, F4 and the act is maximum. Here for xyz, F4 is there, so the next row also for pqr, both F3, F4 are there but maximum act is 40.

I did it using pd.shift()

dft = dfUno.loc[dfUno['wc'].shift().eq('F4')]

But I would like to see it in a more generic way, may be extracting using iterrows(). Like, my code is only true for F4. I want to extract the dataframe used for above F4/ F3 also.

Expected outcome for this:

 final  raw  act    wc             Start            Finish
   abc  xyz   30    M5  17-01-2022 06:00  14-07-2031 02:36
   abc  xyz   40    F4  17-01-2022 06:00  14-07-2031 02:36
   abc  pqr   40  M14S  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   50   M12  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   60  M14S  17-01-2022 06:00  18-01-2026 17:21
   abc  pqr   20    F3  17-01-2022 06:00  14-07-2031 02:36
   abc  pqr   40    F4  17-01-2022 06:00  14-07-2031 02:36

pls suggest something, how to do it.

Upvotes: 2

Views: 917

Answers (1)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

Solution:

ends = df[df.wc.isin(["F3", "F4"])].groupby("raw", sort=False).act.idxmax()

uniqs = ~df.raw.duplicated()
starts = uniqs[uniqs].index

out = df.iloc[np.r_[tuple(map(slice, starts, ends+1))]]

You can first subset the frame to only have rows with wc equal to "F3" or "F4". Then group by the raw column to see which index per group gives a maximum act. These give the "end" locations for the selections:

>>> ends = df[df.wc.isin(["F3", "F4"])].groupby("raw", sort=False).act.idxmax()
>>> ends
raw
xyz    1
pqr    8
Name: act, dtype: int64

Now we need the "start" positions. duplicated function can be used to detect where each group in raw column starts:

>>> uniqs = ~df.raw.duplicated()
>>> uniqs
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: raw, dtype: bool

# get the locations of True's
>>> starts = uniqs[uniqs].index
>>> starts
Int64Index([0, 4], dtype='int64')

Now all we need to do is generate a dynamic slicer for the groups. For this, we map the starts & ends with slice object constructor and then pass it to np.r_ as a tuple to keep them altogether:

# +1 so that we include endpoints
>>> slices = tuple(map(slice, starts, ends+1))
>>> slices
(slice(0, 2, None), slice(4, 9, None))

>>> rows_to_choose = np.r_[slices]
>>> rows_to_choose
array([0, 1, 4, 5, 6, 7, 8])

So the end result arises with iloc now

>>> df.iloc[rows_to_choose]
  final  raw  act    wc             Start            Finish
0   abc  xyz   30    M5  17-01-2022 06:00  14-07-2031 02:36
1   abc  xyz   40    F4  17-01-2022 06:00  14-07-2031 02:36
4   abc  pqr   40  M14S  17-01-2022 06:00  18-01-2026 17:21
5   abc  pqr   50   M12  17-01-2022 06:00  18-01-2026 17:21
6   abc  pqr   60  M14S  17-01-2022 06:00  18-01-2026 17:21
7   abc  pqr   20    F3  17-01-2022 06:00  14-07-2031 02:36
8   abc  pqr   40    F4  17-01-2022 06:00  14-07-2031 02:36

Note 1: The sort=False of groupby assures raw isn't sorted whilst grouping, otherwise we'd get the end position of the row with "pqr" first in ends since "pqr" < "xyz", and that'd conflict with starts' ordering.

Note 2: The above assumes you have a RangeIndex, i.e., 0..N-1 kind of index. If this is not the case, you can copy it before and then put it back, i.e.,

orig_index = df.index
new_df = df.reset_index(drop=True)

... # above operations with `new_df`

out.index = orig_index[out.index]

Upvotes: 1

Related Questions