DEB
DEB

Reputation: 241

How to create new values in a pandas dataframe column based on values from another column

I have a pandas dataframe of values I read in from a csv file. I have a column labeled 'SleepQuality' and the values are float from 0.0 - 100.0. I want to create a new column labeled 'SleepQualityGroup' where values from the original column btw 0 - 49 have a value of 0 in the new column, 50 - 59 = 1 , 60 - 69 = 2, 70 - 79 = 3, 80 - 89 = 4, and 90 - 100 = 5

What would be the best formula to use in order to do this? I am stuck on the logic needed to identify all values in each range and assign to the new value.

An example of what the output would like like below in the new 'SleepQualityGroup' column.

SleepQuality    SleepQualityGroup
80.4              4
90.1              5
66.4              2
50.3              1
86.2              4
75.4              3
45.7              0
91.5              5
61.3              2 
54                1
58.2              1

Upvotes: 7

Views: 830

Answers (2)

Divakar
Divakar

Reputation: 221514

That's basically a binning operation. As such two tools could be used here.

Using np.searchsorted -

bins = np.arange(50,100,10)
df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)

Using np.digitize -

df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)

Sample output -

In [866]: df
Out[866]: 
    SleepQuality  SleepQualityGroup
0           80.4                  4
1           90.1                  5
2           66.4                  2
3           50.3                  1
4           86.2                  4
5           75.4                  3
6           45.7                  0
7           91.5                  5
8           61.3                  2
9           54.0                  1
10          58.2                  1

Runtime test -

In [921]: df
Out[921]: 
    SleepQuality  SleepQualityGroup
0           80.4                  4
1           90.1                  5
2           66.4                  2
3           50.3                  1
4           86.2                  4
5           75.4                  3
6           45.7                  0
7           91.5                  5
8           61.3                  2
9           54.0                  1
10          58.2                  1

In [922]: df = pd.concat([df]*10000,axis=0)

# @Dark's soln using pd.cut
In [923]: %timeit df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
1000 loops, best of 3: 1.04 ms per loop

In [926]: %timeit df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
1000 loops, best of 3: 591 µs per loop

In [927]: %timeit df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
1000 loops, best of 3: 538 µs per loop

Upvotes: 6

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Use pd.cut i.e

df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])

Output:

        SleepQuality  SleepQualityGroup new
0           80.4                  4   4
1           90.1                  5   5
2           66.4                  2   2
3           50.3                  1   1
4           86.2                  4   4
5           75.4                  3   3
6           45.7                  0   0
7           91.5                  5   5
8           61.3                  2   2
9           54.0                  1   1
10          58.2                  1   1

Upvotes: 12

Related Questions