hk2
hk2

Reputation: 487

Creating bins based on condition

My original dataset looks like the sample below:

| id | old_a | new_a | old_b | new_b | ratio_a  | ratio_b |
|----|-------|-------|-------|-------|----------|---------|
| 1  | 350   | 6     | 35    | 0     | 58.33333 | Inf     |
| 2  | 164   | 79    | 6     | 2     | 2.075949 | 3       |
| 3  | 10    | 0     | 1     | 1     | Inf      | 1       |
| 4  | 120   | 1     | 10    | 0     | 120      | Inf     |

Here's the dataframe:

df=[[1,350,6,35,0],[2,164,79,6,2],[3,10,0,1,1],[4,120,1,10,0]]
df= pd.DataFrame(df,columns=['id','old_a','new_a','old_b','new_b'])

I have obtained the column 'ratio_a' and 'ratio_b' (as shown in the table) with the following code:

df['ratio_a']= df['old_a']/df['new_a']
df['ratio_b']= df['old_b']/df['new_b']

Next I wanted to create two more columns with number ranges in which the values of ratio_a and ratio_b would fall. For this purpose, I have written the following code:

bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)

One of the issue that I'm having is that if any value in ratio_a and ratio_b is greater than 100, it should fall in the bucket '>100'. How can I do that? My final result should look like one below:

| id | old_a | new_a | old_b | new_b | ratio_a  | ratio_b | a_range | b_range |
|----|-------|-------|-------|-------|----------|---------|---------|---------|
| 1  | 350   | 6     | 35    | 0     | 58.33333 | Inf     | 40-50   | NaN     |
| 2  | 164   | 79    | 6     | 2     | 2.075949 | 3       | 0-10    | 0-10    |
| 3  | 10    | 0     | 1     | 1     | Inf      | 1       | NaN     | 0-10    |
| 4  | 120   | 1     | 10    | 0     | 120      | Inf     | >100    | NaN     |

Upvotes: 0

Views: 871

Answers (1)

Mehdi Golzadeh
Mehdi Golzadeh

Reputation: 2583

One possible solution:

bins = [0,10,20,30,40,50,60,70,80,90,100,np.inf]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
labels[-1]=">100"
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)

the result:

id  old_a  new_a  old_b  new_b     ratio_a  ratio_b a_range b_range
 1    350      6     35      0   58.333333      inf   50-60     NaN
 2    164     79      6      2    2.075949      3.0    0-10    0-10
 3     10      0      1      1         inf      1.0     NaN    0-10
 4    120      1     10      0  120.000000      inf    >100     NaN

Upvotes: 1

Related Questions