user3014653
user3014653

Reputation: 765

How do I iteratively select rows in pandas based on column values?

I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.

If I have the following Dataframe:

A    B    C 
0    0    1
0    1    1

I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.

So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.

Upvotes: 0

Views: 113

Answers (2)

Corralien
Corralien

Reputation: 120409

Create a boolean mask:

m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}

Output:

>>> m
       A      B      C
0  False  False   True
1  False   True  False
2  False  False   True

>>> d
{'B': [1], 'C': [0, 2]}

I slightly modified your dataframe:

>>> df
   A  B  C
0  0  0  1
1  0  1  1
2  0  0  1

Update

For the expected output on my sample:

for rows, col in zip(m, df.columns):
    if m[col].sum():
        print(f"\n=== {col} ===")
        print(df[m[col]])

Output:

=== B ===
   A  B  C
1  0  1  1

=== C ===
   A  B  C
0  0  0  1
2  0  0  1

Upvotes: 1

rafaelc
rafaelc

Reputation: 59274

Seems like you need a direct use of idxmax

Return index of first occurrence of maximum over requested axis.

NA/null values are excluded.


>>> df.idxmax()
A    0
B    1
C    0
dtype: int64

The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.

To address that, you can use where

>>> df.idxmax().where(~df.eq(0).all())

This will make sure that NaNs are returned for all-zero columns.

A    NaN
B    1.0
C    0.0
dtype: float64

Upvotes: 0

Related Questions