Reputation: 669
Is it possible to use column merge during operator chaining in pandas ? For example,
df2 = df[(df['A'] < 4) & (df['B'] >= 4) & (df['C'] >= 4)]
Here, both B & C have same condition. i.e both the columns should have a value > 4. When I re-write it as,
df2 = df[(df['A'] < 4) & (df['B','C'] >= 4)]
I get an error. Is there an efficient way to write this operator chaining ?
Thanks in Advance.
AP
Upvotes: 0
Views: 596
Reputation: 879729
You could select multiple columns by indexing with a list of column names, and then using all
to combine the results:
df2 = df[(df['A'] < 4) & (df[['B','C']] >= 4).all(axis='columns')]
Note the double brackets in df[['B','C']]
. This returns a sub-DataFrame of df
with columns B
and C
.
Although together it might look like some kind of special double bracket syntax, it isn't really special -- evaluation follows normal Python rules -- it is just that the meaning of the inner and outer brackets are different. The outer brackets indicate we are indexing df
. The inner brackets are used to form the list ['B','C']
. Together they induce Python to call df.__getitem__(['B','C'])
.
Why does df['B','C']
raise a KeyError:
df['B','C']
is equivalent to df[('B','C')]
. df[('B','C')]
has a very
different meaning than df[['B','C']]
. When indexing a DataFrame, Pandas
interprets the tuple ('B','C')
as a single column label. This is particularly
useful for DataFrames with MultiIndexed columns. In that case it selects the column whose first MultiIndexed column level equals B
and whose second column level equals C
. Since your DataFrame doesn't have a MultiIndexed column index nor a single column with the (peculiar) name ('B','C')
, a KeyError is raised when you evaluate df['B','C']
.
An example of a single-indexed DataFrame where df['B','C']
doesn't raise a KeyError:
In [15]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=[('A','B'),('B','C'),('B','D')])
In [16]: df
Out[16]:
(A, B) (B, C) (B, D)
0 5 2 1
1 5 5 3
2 8 8 1
3 9 2 9
4 3 5 8
In [17]: df['B','C']
Out[17]:
0 2
1 5
2 8
3 2
4 5
Name: (B, C), dtype: int64
An example of a MultiIndexed DataFrame where df['B','C']
doesn't raise a KeyError:
In [20]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=pd.MultiIndex.from_tuples([('A','B'),('B','C'),('C','D')]))
In [21]: df
Out[21]:
A B C
B C D
0 6 1 1
1 5 1 0
2 5 7 8
3 6 9 9
4 5 5 0
In [22]: df['B','C']
Out[22]:
0 1
1 1
2 7
3 9
4 5
Name: (B, C), dtype: int64
Upvotes: 2