Mitch
Mitch

Reputation: 596

How to select column based on condition in row?

I have a dataframe df, and want to select only the columns that have a specific value "xyz" in row [0].

I know if I were trying to select rows with a specific value "xyz" in column [0] I would just do:

df[df[0] == 'xyz']

Now I try:

df[df.loc[0] == 'xyz']

and I get error:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

Why won't the same operation work for selecting columns with a condition? And what's the easiest solution to get just the columns I want?

Upvotes: 0

Views: 3389

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Assume that df contains:

    Aa   Bb   Cc   Dd
0  xyz  abc  xyz  efg
1   e1   e2   e3   e4
2   f1   f2   f3   f4

To generate a bool vector (actually a Series) checking whether consecutive cells in row 0 == 'xyz', you can run:

df.iloc[0] == 'xyz'

The result is:

Aa     True
Bb    False
Cc     True
Dd    False
Name: 0, dtype: bool

To get the names of columns of interest, run:

df.columns[df.iloc[0] == 'xyz']

The result is:

Index(['Aa', 'Cc'], dtype='object')

So to get your expected result, retrieve df[...] with the above expression between brackets:

df[df.columns[df.iloc[0] == 'xyz']]

The result is:

    Aa   Cc
0  xyz  xyz
1   e1   e3
2   f1   f3

A comment concerning your code samples

Your first code sample - df[0] == 'xyz' failed because df[0] attempts to retrieve a column with name 0. Since your DataFrame doesn't contain such column, a KeyError: 0 exception is raised.

The second code sample - df[df.loc[0] == 'xyz'] is a better approach. Note that df.loc[0] == 'xyz' is just what I proposed.

The reason of failure is that you now attempt something like boolean indexing on columns, but in this case it does not work. Between brackets you can pass a list of column names, not a list of "boolean indicators".

Upvotes: 2

Roy2012
Roy2012

Reputation: 12493

Here's a way to do that. I'm using dummy data.

df = pd.DataFrame({"x": pd.date_range("2020-01-01", periods = 10, freq = "D"), 
                  "y": range(10), 
                  "z": range(10, 20), 
                   "w": range(10, 30, 2)
                  })
                 
print(df)

==>
           x  y   z   w
0 2020-01-01  0  10  10
1 2020-01-02  1  11  12
2 2020-01-03  2  12  14
3 2020-01-04  3  13  16
4 2020-01-05  4  14  18
...

I would now like to get only the columns that have '10' in the first row.

cols = (df.iloc[0] == 10)

new_df = df.loc[:, cols[cols].index]
print(new_df)


   ==>
        z   w
    0  10  10
    1  11  12
    2  12  14
    3  13  16
    4  14  18
    ...

Upvotes: 0

Related Questions