Reputation: 15217
For example I have simple DF:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal to 900, using methods and idioms of Pandas?
Upvotes: 372
Views: 966530
Reputation: 353479
Sure! Setup:
>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
We can apply column operations and get boolean Series objects:
>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] != 900)
or
>>> (df["B"] > 50) & ~(df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
[Update, to switch to new-style .loc
]:
And then we can use these to index into the object. For read access, you can chain indices:
>>> df["A"][(df["B"] > 50) & (df["C"] != 900)]
2 5
3 8
Name: A, dtype: int64
but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc
instead:
>>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
Upvotes: 541
Reputation: 23421
It may be more readable to assign each condition to a variable, especially if there are a lot of them (maybe with descriptive names) and chain them together using bitwise operators such as (&
or |
). As a bonus, you don't need to worry about brackets ()
because each condition evaluates independently.
m1 = df['B'] > 50
m2 = df['C'] != 900
m3 = df['C'].pow(2) > 1000
m4 = df['B'].mul(4).between(50, 500)
# filter rows where all of the conditions are True
df[m1 & m2 & m3 & m4]
# filter rows of column A where all of the conditions are True
df.loc[m1 & m2 & m3 & m4, 'A']
or put the conditions in a list and reduce it via bitwise_and
from numpy
(wrapper for &
).
conditions = [
df['B'] > 50,
df['C'] != 900,
df['C'].pow(2) > 1000,
df['B'].mul(4).between(50, 500)
]
# filter rows of A where all of conditions are True
df.loc[np.bitwise_and.reduce(conditions), 'A']
Upvotes: 3
Reputation: 139
You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)
df[['A']][df.B.gt(50) & df.C.ne(900)]
df[['A']]
will give you back column A in DataFrame format.
pandas gt
function will return the positions of column B that are greater than 50 and ne
will return the positions not equal to 900.
Upvotes: 8
Reputation: 15058
And remember to use parenthesis!
Keep in mind that &
operator takes a precedence over operators such as >
or <
etc. That is why
4 < 5 & 6 > 4
evaluates to False
. Therefore if you're using pd.loc
, you need to put brackets around your logical statements, otherwise you get an error. That's why do:
df.loc[(df['A'] > 10) & (df['B'] < 15)]
instead of
df.loc[df['A'] > 10 & df['B'] < 15]
which would result in
TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
Upvotes: 59
Reputation: 2913
Another solution is to use the query method:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
'B': [randint(1, 9) * 10 for x in xrange(10)],
'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df
A B C
0 7 20 300
1 7 80 700
2 4 90 100
3 4 30 900
4 7 80 200
5 7 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
print df.query('B > 50 and C != 900')
A B C
1 7 80 700
2 4 90 100
4 7 80 200
5 7 60 800
Now if you want to change the returned values in column A you can save their index:
my_query_index = df.query('B > 50 & C != 900').index
....and use .iloc
to change them i.e:
df.iloc[my_query_index, 0] = 5000
print df
A B C
0 7 20 300
1 5000 80 700
2 5000 90 100
3 4 30 900
4 5000 80 200
5 5000 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
Upvotes: 84