Reputation: 17631
I have a pandas DataFrame as follows:
import pandas as pd
import numpy as np
data = {"first_column": ["item1", "item2", "item3", "item4", "item5", "item6", "item7"],
"second_column": ["cat1", "cat1", "cat1", "cat2", "cat2", "cat2", "cat2"],
"third_column": [5, 1, 8, 3, 731, 189, 9]}
df = pd.DataFrame(data)
df
first_column second_column third_column
0 item1 cat1 5
1 item2 cat1 1
2 item3 cat1 8
3 item4 cat2 3
4 item5 cat2 731
5 item6 cat2 189
6 item7 cat2 9
Now, let's say I wanted to create a fourth column showing the classification of the third column using pandas.cut()
. Here, I label each row whether the element in third_column
is less than or equal to ten, <=10
.
df["less_than_ten"]= pd.cut(df.third_column, [-np.inf, 10, np.inf], labels=(1,0))
And the resulting dataframe is now:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 1
4 item5 cat2 731 0
5 item6 cat2 189 0
6 item7 cat2 9 1
Question: Notice the second column second_column
, with categories cat1
and cat2
. How would I use pandas.cut()
to reclassify these values based on the "class" in second_column
?
More importantly, let's say I wanted more complex intervals, e.g. less or equal to 500 le(500) and greater than or equal to 20 ge(20)? How would this be done? In this case, there should be a 1 labeled by grouping:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 1
4 item5 cat2 731 0
5 item6 cat2 189 1
6 item7 cat2 9 1
Upvotes: 2
Views: 4956
Reputation: 57033
While fully understanding that my proposed solution looks like a hack and gives numbers that are different from yours, I still offer it here:
df['less_than_ten'] = (df.second_column=='cat1').astype(int) +\
(df.third_column<10).astype(int)
# first_column second_column third_column less_than_ten
#0 item1 cat1 5 2
#1 item2 cat1 1 2
#2 item3 cat1 8 2
#3 item4 cat2 3 1
#4 item5 cat2 731 0
#5 item6 cat2 189 0
#6 item7 cat2 9 1
You can make the conditions in the parentheses arbitrary complex, as long as they are boolean (True
/False
).
Upvotes: 1
Reputation: 210832
I wouldn't use pd.cut
in this case:
df['less_than_ten'] = df.third_column.le(10).astype(np.uint8)
df.loc[df.second_column=='cat2','less_than_ten'] = \
df.loc[df.second_column=='cat2','third_column'].le(10).astype(np.uint8) + 2
Result:
In [99]: df
Out[99]:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 3
4 item5 cat2 731 2
5 item6 cat2 189 2
6 item7 cat2 9 3
Upvotes: 1