Python: how to split one variable according to bins from another variable?

Question

I want to create an histogram of variable V based on the bins of variable X. To do this, I read an Excel file which looks like this:

Column X   Column V
99.9       0
100.0      3
25.17      2
39.45      1
66.52      1
17.17      6
9.25       2
86.11      3
84.09      3

For each bin of the variable X I want to compute the average of the V values associated with it. For example:

X bin: 0-30 -> avg(V)=(2+6+2)/3=3.33
X bin: 31-80 -> avg(V)=(1+1)/2=1.00
X bin: 81-100 -> avg(V)=(3+3+0+3)/4=2.25

So that I come up with:

X bin    avg(V)
0-30     3.33
31-80    1.00
81-100   2.25

To do this I have written the following code block, in which I use some lists to gather all the V values which fall within each X bins (binwidth=10).

EDIT

I have a problem with the lengths of my lists. With an excel file of 1000 rows, for instance, where there is only 1 V value belonging to the bin 41-50. However, if I enter len(islands_4150)=999. Where does the code get the other 998 values?

from openpyxl import load_workbook
wb = load_workbook(filename = 'myfile.xlsx')
ws=wb.active
cell_range_1 = ws['X2':'X1001']
cell_range_2 = ws['V2':'V1001']
cf_list=[] #List with X values
island_list=[] #List with V values
for row in range(2,1001): 
    for column in 'X':
        cell_name_1="{}{}".format(column, row) #X
        cf_list.append(ws[cell_name_1].value)
        x=map(lambda x: int(x) if x%1==0 else x, cf_list)
        for column in 'V':
            cell_name_2="{}{}".format(column, row) #V
            island_list.append(ws[cell_name_2].value)
            v=map(lambda x: int(x) if x%1==0 else x, island_list)
islands_010=[] #List with values from column V which corresponding values from column X are 0<=value<=10
islands_1120=[]
islands_2130=[]
islands_3140=[]
islands_4150=[]
islands_5160=[]
islands_6170=[]
islands_7180=[]
islands_8190=[]
islands_91100=[]
for i, val in enumerate(x):
    for j, elem in enumerate(v):
        if x[i]>=0 and x[i]<=10:
            islands_010.append(v[i])
        elif x[i]>=11 and x[i]<=20:
            islands_1120.append(v[i])
        elif x[i]>=21 and x[i]<=30:
            islands_2130.append(v[i])
        elif x[i]>=31 and x[i]<=40:
            islands_3140.append(v[i])
        elif x[i]>=41 and x[i]<=50:
            islands_4150.append(v[i])
        elif x[i]>=51 and x[i]<=60:
            islands_5160.append(v[i])
        elif x[i]>=61 and x[i]<=70:
            islands_6170.append(v[i])
        elif x[i]>=71 and x[i]<=80:
            islands_7180.append(v[i])
        elif x[i]>=81 and x[i]<=90:
            islands_8190.append(v[i])
        elif x[i]>=91 and x[i]<=100:
            islands_91100.append(v[i])

if len(islands_010)==0:
    print ('Avg islands 0-10: 0') 
else:
    avg010=round(reduce(lambda x, y: x + y, islands_010) / len(islands_010),3)
    print ('Avg islands 0-10: '+str(avg010))

if len(islands_1120)==0:
    print ('Avg islands 11-20: 0') 
else:
    avg1120=round(reduce(lambda x, y: x + y, islands_1120) / len(islands_1120),3)
    print ('Avg islands 11-20: '+str(avg1120))

if len(islands_2130)==0:
    print ('Avg islands 21-30: 0')
else:
    avg2130=round(reduce(lambda x, y: x + y, islands_2130) / len(islands_2130),3)
    print ('Avg islands 21-30: '+str(avg2130))

if len(islands_3140)==0:
    print ('Avg islands 31-40: 0')
else:
    avg3140=round(reduce(lambda x, y: x + y, islands_3140) / len(islands_3140),3)
    print ('Avg islands 31-40: '+str(avg3140))

if len(islands_4150)==0:
    print ('Avg islands 41-50: 0')
else:
    avg4150=round(reduce(lambda x, y: x + y, islands_4150) / len(islands_4150),3)
    print ('Avg islands 41-50: '+str(avg4150))

if len(islands_5160)==0:
    print ('Avg islands 51-60: 0')
else:
    avg5160=round(reduce(lambda x, y: x + y, islands_5160) / len(islands_5160),3)
    print ('Avg islands 51-60: '+str(avg5160))

if len(islands_6170)==0:
    print ('Avg islands 61-70: 0')
else:
    avg6170=round(reduce(lambda x, y: x + y, islands_6170) / len(islands_6170),3)
    print ('Avg islands 61-70: '+str(avg6170))

if len(islands_7180)==0:
    print ('Avg islands 71-80: 0')
else:
    avg7180=round(reduce(lambda x, y: x + y, islands_7180) / len(islands_7180),3)
    print ('Avg islands 71-80: '+str(avg7180))

if len(islands_8190)==0:
    print ('Avg islands 81-90: 0')
else:
    avg8190=round(reduce(lambda x, y: x + y, islands_8190) / len(islands_8190),3)
    print ('Avg islands 81-90: '+str(avg8190))

if len(islands_91100)==0:
    print ('Avg islands 91-100: 0')
else:
    avg91100=round(reduce(lambda x, y: x + y, islands_91100) / len(islands_91100),3)
    print ('Avg islands 91-100: '+str(avg91100))

wnnmaw · Accepted Answer

Your code is, as it stands, rather poorly constructed, and this is clouding the problem.

The first problem is whitespace. You need some.

The next is the with the lines for column in 'X': and for column in 'V':. These two for loops are useless and they can be replaced with:

cell_name_1="X{}".format(row) #X variable
cell_name_2="V{}".format(row) #V variable

Furthermore, I would recommend grabbing the cell values, and then doing all of your comparisons:

x_val =  float(ws[cell_name_1].value)
v_val =  int(ws[cell_name_2].value)

Ranges in python are inclusive on the first number, and exclusive on the last. Therefore the range in your first loop should go to 1002 such that the last row is 1001.

for row in range(2, 1002):

I would recommend using ws = WB.get_sheet_by_name("sheet_name") to retrieve the sheet, rather than ws=wb.active to ensure you consistently get the sheet you want.

Finally, we get to the actual problem. Your current method is reading directly from excel straight into bins. What you should do is read all the data from excel, and then manipulate it to produce the bins you want. The first step is to get the data into a python structure which makes your life easiest, I recommend a list of tuples:

islands.append((x_val,v_val))

This will produce something like this:

[(99.9, 0), (100.0, 3), (25.17, 2), (39.45, 1), (66.52, 1), (17.17, 6), (9.25, 2), (86.11, 3), (84.09, 3)]

Now we should sort the data by the column X value:

islands.sort(key = lambda x: x[0])

To produce:

[(9.25, 2), (17.17, 6), (25.17, 2), (39.45, 1), (66.52, 1), (84.09, 3), (86.11, 3), (99.9, 0), (100.0, 3)]

Now that our data is sorted, we can easily produce a dictionary of values defined by the max value of each bin:

bins = [30, 80, 100]
binned_data = {key: [] for key in bins}
for item in islands:
    for bin in bins:
        if item[0] <= bin:
            binned_data[bin].apppend(item[1])
            break

This results in a dictionary like so:

{80: [1, 1], 100: [0, 3, 3, 3], 30: [2, 6, 2]}

From here you can trivially compute the averages with

averages = {bin: sum(binned_data[bin])/float(len(binned_data[bin])) for bin in binned_data}

Putting it all together:

from openpyxl import load_workbook

wb = load_workbook(filename = 'myfile.xlsx')
ws = wb.get_sheet_by_name("sheet_name")

islands = []

for row in range(2,1002):
    cell_name_1="X{}".format(row) #X variable
    cell_name_2="V{}".format(row) #V variable

    x_val =  float(ws[cell_name_1].value)
    v_val =  int(ws[cell_name_2].value)

    islands.append((x_val,v_val))

islands.sort(key = lambda x: x[0])

bins = [30, 80, 100]
binned_data = {key: [] for key in bins}

for item in islands:
    for bin in bins:
        if item[0] <= bin:
            binned_data[bin].apppend(item[1])
            break

averages = {bin: sum(binned_data[bin])/float(len(binned_data[bin])) for bin in binned_data}

Python: how to split one variable according to bins from another variable?

Answers (1)

Related Questions