Reputation: 45
Currently have a CSV file that outputs a dateframe as follows:
[in]
df = pd.read_csv(file_name)
df.sort('TOTAL_MONTHS', inplace=True)
print df[['TOTAL_MONTHS','COUNTEM']]
[out]
TOTAL_MONTHS COUNTEM
12 0
12 0
12 2
25 10
25 0
37 1
68 3
I want to get the total number of rows (by TOTAL_MONTHS) for which the 'COUNTEM' value falls within a preset bin.
The data is going to be entered into a histogram via excel/powerpoint with:
X-axis = Number of contracts
Y-axis = Total_months
Color of bar = COUNTEM
The input of the graph is like this (columns being COUNTEM bins):
MONTHS 0 1-3 4-6 7-10 10+ 20+
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
...
12 2 1 0 0 0 0
...
25 1 0 0 0 1 0
...
37 0 1 0 0 0 0
...
68 0 1 0 0 0 0
Ideally I'd like the code to output a dataframe in that format.
Upvotes: 0
Views: 110
Reputation: 13965
Interesting problem. Knowing pandas (as I don't properly) there may well be a much fancier and simpler solution to this. However, doing it through iterations is also possible in the following manner:
#First, imports and create your data
import pandas as pd
DF = pd.DataFrame({'TOTAL_MONTHS' : [12, 12, 12, 25, 25, 37, 68],
'COUNTEM' : [0, 0, 2, 10, 0, 1, 3]
})
#Next create a data frame of 'bins' with the months as index and all
#values set at a default of zero
New_DF = pd.DataFrame({'bin0' : 0,
'bin1' : 0,
'bin2' : 0,
'bin3' : 0,
'bin4' : 0,
'bin5' : 0},
index = DF.TOTAL_MONTHS.unique())
In [59]: New_DF
Out[59]:
bin0 bin1 bin2 bin3 bin4 bin5
12 0 0 0 0 0 0
25 0 0 0 0 0 0
37 0 0 0 0 0 0
68 0 0 0 0 0 0
#Create a list of bins (rather than 20 to infinity I limited it to 100)
bins = [[0], range(1, 4), range(4, 7), range(7, 10), range(10, 20), range(20, 100)]
#Now iterate over the months of the New_DF index and slice the original
#DF where TOTAL_MONTHS equals the month of the current iteration. Then
#get a value count from the original data frame and use integer indexing
#to place the value count in the appropriate column of the New_DF:
for month in New_DF.index:
monthly = DF[DF['TOTAL_MONTHS'] == month]
counts = monthly['COUNTEM'].value_counts()
for count in counts.keys():
for x in xrange(len(bins)):
if count in bins[x]:
New_DF.ix[month, x] = counts[count]
Which gives me:
In [62]: New_DF
Out[62]:
bin0 bin1 bin2 bin3 bin4 bin5
12 2 1 0 0 0 0
25 1 0 0 0 1 0
37 0 1 0 0 0 0
68 0 1 0 0 0 0
Which appears to be what you want. You can rename the index as you see fit....
Hope this helps. Perhaps someone has a solution that uses a built in pandas function, but for now this seems to work.
Upvotes: 2