John Taylor
John Taylor

Reputation: 737

Pandas pd.cut bins with lower bound issue

sample data I am trying to bin by 5 year intervals starting with the year 1980 using this code for pd.cut

bins = list(range(1980, 2025, 4))    
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, include_lowest=True)

results in this dataframe with a less than ideal starting value for the 1980 row's bin:

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      **(1979.999,** 1984.0]
1981    1.0    34.0    1202     (1979.999, 1984.0]
1982    2.0    35.0    1428     (1979.999, 1984.0]
1983    2.0    37.0    2374     (1979.999, 1984.0]
1984    2.0    46.0    2890     (1979.999, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

And, removing the include_lowest=True bit, results in this with no bin at all for 1980:

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      NaN
1981    1.0    34.0    1202     (1980.0, 1984.0]
1982    2.0    35.0    1428     (1980.0, 1984.0]
1983    2.0    37.0    2374     (1980.0, 1984.0]
1984    2.0    46.0    2890     (1980.0, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

So, the quiz question here is, how to use pd.cut to get this ideal result:

index   col1   col2    col3     bin_col                         
1980    1.0    30.0    980      **(1980.0, 1984.0]**
1981    1.0    34.0    1202     (1980.0, 1984.0]
1982    2.0    35.0    1428     (1980.0, 1984.0]
1983    2.0    37.0    2374     (1980.0, 1984.0]
1984    2.0    46.0    2890     (1980.0, 1984.0]
1985    3.0    63.0    4011     (1984.0, 1988.0]

I followed the documentation and several examples and the above code was the best outcome. I am about to start manually converting the bin column values to strings and editing the '1979.999' part to read '1980' so the bins make sense to humans. But, there has to be a better way. Hence, my question.

Upvotes: 0

Views: 692

Answers (1)

kaihami
kaihami

Reputation: 815

Well this is a little bit tricky,

But you can use, labels.

labels = ['(%d, %d]'%(bins[i], bins[i+1]) for i in range(len(bins)-1)]
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, labels=labels, include_lowest=True)

Upvotes: 2

Related Questions