Reputation: 25991
Sorry if this is a total noob question, but I wanted to try to find similar values in a list. Actually more specifically, I wanted to see if there was a way I could score the items.
I know in python I can just take one list and do a '==' to see if its the same but what if they are not the exact same, but instead have somewhat similar values(or not).
Here's an example:
#Batch one
[1, 10, 20]
[5, 15, 10]
[70, 19, 15]
[50, 40, 20]
#Batch two
[46, 19, 8]
[6, 14, 8]
[2, 11, 44]
Say I want to score/rank the two batches by how similar they are to each other. I thought I could just add all the numbers and then compare them by the total value, but I don't think that works because [5, 6,1000] [600, 200, 211] would seem similar. In this example, [5, 15, 10] and [6, 14, 8] should get the highest score.
I thought of dividing each value and look at the percent difference but that seems really expensive if the lists get large with many variables(I may eventually have thousands of lists with over 800 variables in each) and I suspect there maybe a better approach.
Any suggestions ?
Upvotes: 1
Views: 3690
Reputation: 39893
How about using the Euclidean distance?
In a list comprehension:
def distance(lista, listb):
return sum( (b - a) ** 2 for a,b in zip(lista, listb) ) ** .5
Or more written out:
def distance(lista, listb):
runsum = 0.0
for a, b in zip(lista, listb):
# square the distance of each
# then add them back into the sum
runsum += (b - a) ** 2
# square root it
return runsum **.5
Upvotes: 3
Reputation: 7807
The obvious solutions are already here. Basically they correspond to calculating |x-mean(x)|^p for each set (if p=2, this is equivalent to calculating variance).
Since you mentioned about percentages..... given [1,2,3] and [101,103,105], which one would you prefer as the final answer? If answer is 'first', then never mind. If it is second, you would have to normalize the variance with mean.
Solution is: (SquareMean - Mean^2)/Mean^2, where SquareMean = (a^2+b^2+c^2)/3, Mean = (a+b+c)/3 .
Upvotes: 1
Reputation: 284582
If I'm understanding you correctly, you're basically wanting to see how tight a cluster you have?
So, if you think of your data as sets of points in 3D, you're trying to find the spread of each cluster?
(In other words you want to compare how internally similar the two batches are?)
In that case, consider something like the following (using numpy to speed thing up):
import numpy as np
def spread(group):
return group.var(axis=0).sum()
group1 = np.array([[1, 10, 20],
[5, 15, 10],
[70, 19, 15],
[50, 40, 20]], dtype=np.float)
group2 = np.array([[46, 19, 8],
[6, 14, 8],
[2, 11, 44]], dtype=np.float)
print spread(group1), spread(group2)
So, in this case, group2 is the most internally similar.
If, instead, you're interested in finding how "close" the two groups are to each other, then you could compare the distance between their centers
legs = group1.mean(axis=0) - group2.mean(axis=0)
distance = np.sqrt(np.sum(legs**2))
Or are you wanting to find the two "points" within each group that are the closest? (In which case you'd use a distance matrix (or a more efficient algorithm for more points...)).
Upvotes: 1
Reputation: 23
I don't know how but I was thinking about trying to use standard deviation, because similar values would (in theory) have similar deviation?
In this case [5, 15, 10] gets a standard deviation of 5 and [6, 14, 18] gets 6.1101
Upvotes: 0
Reputation: 47609
a = [1, 10, 20]
b = [5, 15, 10]
c = [70, 19, 15]
d = [50, 40, 20]
def sim(seqA, seqB):
return sum([abs(a - b) for (a, b) in zip(seqA, seqB)])
print sim(a, a) # => 0
print sim(a, b) # => 19
print sim(a, c) # => 83
print sim(a, d) # => 79
Lower number means more similar. 0 means identical.
Upvotes: 3