Christopher
Christopher

Reputation: 2232

Classifying Data in a New Column

I have following df:

Column 1
1
2435
3345
104
505
6005
10000
80000
100000
4000000
4440
520
...

This structure is not the best to plot a histogram, which is the main purpose. Bins don't really solve the problem either, at least from what I've tested so far. That's why I like to create my own bins in a new column:

I basically want to assign every value within a certain range in column 1 a bucket in column2, so that it look like this:

Column 1    Column2
1           < 10000
2435        < 10000
3345        < 10000  
104         < 10000
505         < 10000
6005        < 10000
10000       < 50000
80000       < 150000
100000      < 150000
4000000     < 250000
4440        < 10000
520         < 10000
...

Once I get there, creating a plot will be much easier.

Thanks!

Upvotes: 1

Views: 57

Answers (2)

EdChum
EdChum

Reputation: 393893

There is a pandas equivalent to this cut there is a section describing this here. cut returns the open closed intervals for each value:

In [29]:    
df['bin'] = pd.cut(df['Column 1'], bins = [0,10000, 50000, 150000, 25000000])
df

Out[29]:

    Column 1                 bin
0          1          (0, 10000]
1       2435          (0, 10000]
2       3345          (0, 10000]
3        104          (0, 10000]
4        505          (0, 10000]
5       6005          (0, 10000]
6      10000          (0, 10000]
7      80000     (50000, 150000]
8     100000     (50000, 150000]
9    4000000  (150000, 25000000]
10      4440          (0, 10000]
11       520          (0, 10000]

The dtype of the column is a Category and can be used for filtering, counting, plotting etc.

Upvotes: 2

Ami Tavory
Ami Tavory

Reputation: 76297

numpy.histogram takes a bins parameter which can be an integer array, and returns an array of the counts within those bins. So, if you run

import numpy as np

counts, _ = np.histogram(df[`Column 1`].values, [10000, 50000, 150000, 250000])

You will have the bins you want. From here, you can do whatever you want, including plotting the number of counts within each bin:

plot(counts)

Upvotes: 1

Related Questions