Reputation: 403
I have three dataframes that have column "City". All three dataframes have a different set of city names.
I want to find the percentage of total matches between this column of each dataframe.
For this purpose I used set method and got three arrays
set1 = set(df1['City'])
set2 = set(df2['City'])
set3 = set(df3['City'])
But how should I find the percentage? I used these functions, but I'm not sure I did everything right
(len(set1) - len(set2))/len(set1)*100
(len(set1) - len(set3))/len(set1)*100
(len(set2) - len(set3))/len(set2)*100
Is this record right?
Upvotes: 0
Views: 54
Reputation: 17
From the pure mathimatical side of things: I assume that you want to find the percentage of cities matching between respectively set1 & set2, set1 & set3 and set2 & set3.
To calculate this percentage, you need to find the number of matches and the length of the set of cities compared.
Then the percentage can be calculated as follows:
Percentage match 1 & 2 = [(number of matches between 1 & 2)/(length of the set)]*100
For the code side of things: i agree with Sparkofska.
Upvotes: 0
Reputation: 1320
You probably want this:
percentage = ( len(set1.intersection(set2)) / len(set1.union(set2)) )*100
which gives you the percentage of common elements in set1
and set2
.
This is also known as Jaccard Index, a measurement for similarity of sets.
Upvotes: 1