Reputation: 737
sample data I am trying to bin by 5 year intervals starting with the year 1980 using this code for pd.cut
bins = list(range(1980, 2025, 4))
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, include_lowest=True)
results in this dataframe with a less than ideal starting value for the 1980 row's bin:
index col1 col2 col3 bin_col
1980 1.0 30.0 980 **(1979.999,** 1984.0]
1981 1.0 34.0 1202 (1979.999, 1984.0]
1982 2.0 35.0 1428 (1979.999, 1984.0]
1983 2.0 37.0 2374 (1979.999, 1984.0]
1984 2.0 46.0 2890 (1979.999, 1984.0]
1985 3.0 63.0 4011 (1984.0, 1988.0]
And, removing the include_lowest=True bit, results in this with no bin at all for 1980:
index col1 col2 col3 bin_col
1980 1.0 30.0 980 NaN
1981 1.0 34.0 1202 (1980.0, 1984.0]
1982 2.0 35.0 1428 (1980.0, 1984.0]
1983 2.0 37.0 2374 (1980.0, 1984.0]
1984 2.0 46.0 2890 (1980.0, 1984.0]
1985 3.0 63.0 4011 (1984.0, 1988.0]
So, the quiz question here is, how to use pd.cut to get this ideal result:
index col1 col2 col3 bin_col
1980 1.0 30.0 980 **(1980.0, 1984.0]**
1981 1.0 34.0 1202 (1980.0, 1984.0]
1982 2.0 35.0 1428 (1980.0, 1984.0]
1983 2.0 37.0 2374 (1980.0, 1984.0]
1984 2.0 46.0 2890 (1980.0, 1984.0]
1985 3.0 63.0 4011 (1984.0, 1988.0]
I followed the documentation and several examples and the above code was the best outcome. I am about to start manually converting the bin column values to strings and editing the '1979.999' part to read '1980' so the bins make sense to humans. But, there has to be a better way. Hence, my question.
Upvotes: 0
Views: 692
Reputation: 815
Well this is a little bit tricky,
But you can use, labels.
labels = ['(%d, %d]'%(bins[i], bins[i+1]) for i in range(len(bins)-1)]
final_usage_data['bins'] = pd.cut(final_usage_data.index, bins=bins, labels=labels, include_lowest=True)
Upvotes: 2