homayoun
homayoun

Reputation: 83

Using a matrix format in Python to calculate my own similarity score

I have a csv file and that is the values of commodities traded between countries, something like this:

Country  Comm  Value
 GER       1     200
 GER       2     300
 GER       45    354
 USA       2     100
 USA       85    500
 UK        2     240
 UK        85    900

I have created a matrix with this data. In this created matrix, rows are countries and columns are commodities' codes and each element shows the value of trade. The number of commodities is 97 and I've used the following code to create that matrix:

rfile = open('file path','r')
rfile.next()
dic_c1_products = {}
for i in rfile :
    lns = i.strip().split(',')
    c1 = lns[0]
    p = lns[1]
    value= lns[2]
    if not dic_c1_products.has_key(c1):
        dic_c1_products[c1] = [(p,value),]
    else:
        dic_c1_products[c1].append((p,value))
 product_count  = 97
 c1_list = dic_c1_products.keys()
 matrix_c1_products = [[0 for col in range(int(product_count)+1)] for row     in range(len(c1_list))]
 for c1 in dic_c1_products:
      for p, v in dic_c1_products[c1]:
           matrix_c1_products[c1_list.index(c1)][int(p)] = int(v)
 print 'Matirix Done'

Now I want to calculate an index score for each pair of countries (the pair score is: total trade in common over total trade of each country). The created matrix has a form like this:

Countries   Commodity1 Commodity2 Commodity45 Commodity85
 GER           200        300         45          0
 USA            0         100          0         500
 UK             0         240          0         900

First I want to sum the total values of the SAME commodities that two countries are trading and then divide this amount to TOTAL trade of those two countries. For example if we consider GER-USA, they both trade commodities number 2, so I want to have summation of these common commodities (300+100) over the summation of total trade of Germany and the United States : (Fist Row:200+300+354)+(Second Row: 100+500) In simple words, if we consider the matrix: First, I want to calculate the total values for GER and USA rows. Second, to calculate the values of the total common commodities which are being traded Third, divide the value of stage two to the value of stage one. For doing this, I have written the following code:

for i in range(len(matrix_c1_products)):
    for j in range(i, len(matrix_c1_products)):
            dividend=sum([matrix_c1_products[i]])+sum([matrix_c1_products[j]])
        for k in matrix_c1_products[i]:
            for l in matrix_c1_products[j]:
              #  print k,l
                if int(k)==int(0):
                    pass
                if int(l)==int(0):
                    pass
                else:
                    commonone.append(k)
                    commontwo.append(l)
             divisor=sum(commonone)+sum(commontwo)
             shares=int(divisor/dividend)
             print shares, divisor, dividend

but there is a problem with commonone list. I intend to remove zeros from two rows and add the existence values but because of the loop, the same number repeats in the list and the results are not correct. Any help would be appreciated.

Upvotes: 1

Views: 785

Answers (1)

Kasravnd
Kasravnd

Reputation: 107347

As a more pythonic way you can first create a dictionary of your rows that it could be done with following dict comprehension :

hart_dict={i[0]:map(int,i[1:]) for i in spamreader}
{' USA': [0, 100, 0, 500], ' GER': [200, 300, 45, 0], ' UK': [0, 240, 0, 900]}

Then create your pairs with itertools.combinations :

capirs= list(combinations(next(z),2))
[(' GER', ' USA'), (' GER', ' UK'), (' USA', ' UK')]

And then calculate the sum of commodities :

row_sums=[sum(map(int,i)) for i in z]
[200, 640, 45, 1400]

and at last you can loop over your pairs and calculate your expected result.

import csv
from itertools import combinations,izip

commodities=['Commodity1' ,'Commodity2', 'Commodity45' ,'Commodity85']
with open('ex.csv', 'rb') as csvfile:
    spamreader = list(csv.reader(csvfile, delimiter=','))
    chart_dict={i[0]:map(int,i[1:]) for i in spamreader}
    z=izip(*spamreader)
    capirs= list(combinations(next(z),2))
    row_sums=[sum(map(int,i)) for i in z]

    for i,j in capirs:
      for index,com in enumerate(commodities):
        print i,j,com,float(chart_dict[i][index]+chart_dict[j][index])/row_sums[index]

Result :

GER  USA Commodity1 1.0
 GER  USA Commodity2 0.625
 GER  USA Commodity45 1.0
 GER  USA Commodity85 0.357142857143
 GER  UK Commodity1 1.0
 GER  UK Commodity2 0.84375
 GER  UK Commodity45 1.0
 GER  UK Commodity85 0.642857142857
 USA  UK Commodity1 0.0
 USA  UK Commodity2 0.53125
 USA  UK Commodity45 0.0
 USA  UK Commodity85 1.0

Upvotes: 2

Related Questions