Reputation: 596
I have a dataframe df, and want to select only the columns that have a specific value "xyz" in row [0].
I know if I were trying to select rows with a specific value "xyz" in column [0] I would just do:
df[df[0] == 'xyz']
Now I try:
df[df.loc[0] == 'xyz']
and I get error:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Why won't the same operation work for selecting columns with a condition? And what's the easiest solution to get just the columns I want?
Upvotes: 0
Views: 3389
Reputation: 30971
Assume that df contains:
Aa Bb Cc Dd
0 xyz abc xyz efg
1 e1 e2 e3 e4
2 f1 f2 f3 f4
To generate a bool vector (actually a Series) checking whether consecutive cells in row 0 == 'xyz', you can run:
df.iloc[0] == 'xyz'
The result is:
Aa True
Bb False
Cc True
Dd False
Name: 0, dtype: bool
To get the names of columns of interest, run:
df.columns[df.iloc[0] == 'xyz']
The result is:
Index(['Aa', 'Cc'], dtype='object')
So to get your expected result, retrieve df[...] with the above expression between brackets:
df[df.columns[df.iloc[0] == 'xyz']]
The result is:
Aa Cc
0 xyz xyz
1 e1 e3
2 f1 f3
Your first code sample - df[0] == 'xyz'
failed because df[0]
attempts to retrieve a column with name 0.
Since your DataFrame doesn't contain such column, a KeyError: 0
exception is raised.
The second code sample - df[df.loc[0] == 'xyz']
is a better approach.
Note that df.loc[0] == 'xyz'
is just what I proposed.
The reason of failure is that you now attempt something like boolean indexing on columns, but in this case it does not work. Between brackets you can pass a list of column names, not a list of "boolean indicators".
Upvotes: 2
Reputation: 12493
Here's a way to do that. I'm using dummy data.
df = pd.DataFrame({"x": pd.date_range("2020-01-01", periods = 10, freq = "D"),
"y": range(10),
"z": range(10, 20),
"w": range(10, 30, 2)
})
print(df)
==>
x y z w
0 2020-01-01 0 10 10
1 2020-01-02 1 11 12
2 2020-01-03 2 12 14
3 2020-01-04 3 13 16
4 2020-01-05 4 14 18
...
I would now like to get only the columns that have '10' in the first row.
cols = (df.iloc[0] == 10)
new_df = df.loc[:, cols[cols].index]
print(new_df)
==>
z w
0 10 10
1 11 12
2 12 14
3 13 16
4 14 18
...
Upvotes: 0