Arun
Arun

Reputation: 669

Operator chaining in pandas dataframe

Is it possible to use column merge during operator chaining in pandas ? For example,

df2 = df[(df['A'] < 4) & (df['B'] >= 4) & (df['C'] >= 4)]

Here, both B & C have same condition. i.e both the columns should have a value > 4. When I re-write it as,

df2 = df[(df['A'] < 4) & (df['B','C'] >= 4)]

I get an error. Is there an efficient way to write this operator chaining ?

Thanks in Advance.

AP

Upvotes: 0

Views: 596

Answers (1)

unutbu
unutbu

Reputation: 879729

You could select multiple columns by indexing with a list of column names, and then using all to combine the results:

df2 = df[(df['A'] < 4) & (df[['B','C']] >= 4).all(axis='columns')]

Note the double brackets in df[['B','C']]. This returns a sub-DataFrame of df with columns B and C. Although together it might look like some kind of special double bracket syntax, it isn't really special -- evaluation follows normal Python rules -- it is just that the meaning of the inner and outer brackets are different. The outer brackets indicate we are indexing df. The inner brackets are used to form the list ['B','C']. Together they induce Python to call df.__getitem__(['B','C']).


Why does df['B','C'] raise a KeyError:

df['B','C'] is equivalent to df[('B','C')]. df[('B','C')] has a very different meaning than df[['B','C']]. When indexing a DataFrame, Pandas interprets the tuple ('B','C') as a single column label. This is particularly useful for DataFrames with MultiIndexed columns. In that case it selects the column whose first MultiIndexed column level equals B and whose second column level equals C. Since your DataFrame doesn't have a MultiIndexed column index nor a single column with the (peculiar) name ('B','C'), a KeyError is raised when you evaluate df['B','C'].


An example of a single-indexed DataFrame where df['B','C'] doesn't raise a KeyError:

In [15]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=[('A','B'),('B','C'),('B','D')])

In [16]: df
Out[16]: 
   (A, B)  (B, C)  (B, D)
0       5       2       1
1       5       5       3
2       8       8       1
3       9       2       9
4       3       5       8

In [17]: df['B','C']
Out[17]: 
0    2
1    5
2    8
3    2
4    5
Name: (B, C), dtype: int64

An example of a MultiIndexed DataFrame where df['B','C'] doesn't raise a KeyError:

In [20]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=pd.MultiIndex.from_tuples([('A','B'),('B','C'),('C','D')]))

In [21]: df
Out[21]: 
   A  B  C
   B  C  D
0  6  1  1
1  5  1  0
2  5  7  8
3  6  9  9
4  5  5  0

In [22]: df['B','C']
Out[22]: 
0    1
1    1
2    7
3    9
4    5
Name: (B, C), dtype: int64

Upvotes: 2

Related Questions