Reputation: 1303
I have a two data sets as lists, for example:
xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
Series are market data that may contain tens of thousands numbers, their length is same.
I need to find "difference" in percents, that shows "how much similarity/dissimilarity between series in percents".
Currently I have an idea to build charts for every list (xa, xb as Y ax, and range(1, len(xa)) as X ax). interpolate functions for xa, xb, then calculate the area of xa, xb (with integration) and area of difference between xa and xb. After this the dissimilarity is (difference area)*100%/(xa area + xb area).
I wonder if this question has more simple solution. If not - how can I calculate difference area of xa, xb? Charts are build with scipy, numpy, matplotlib.
update: I'm looking for ONE number that represent the difference between sets. Percents are preferred.
Upvotes: 0
Views: 3410
Reputation: 70068
Well if you want a similarity metric for comparing two 1D vectors and that preferably returns a value between 0 and 1 (or 0 and 100%), cosine similarity satisfies those criteria (subject to the proviso at the end). (Whether it's appropriate given the context of your problem, i don't know, but you know the context, so you can certainly make that determination.)
import numpy as NP
import numpy.linalg as LA
# generate some data
fnx = lambda : NP.random.randint(0, 10, 10)
s1, s2 = fnx(), fnx()
# a function to calculate cosine similarity
cx = lambda a, b : round(NP.inner(a, b)/(LA.norm(a)*LA.norm(b)), 2)
cx(s1, s2)
# returns 0.85
If you have many 1D vectors, then one approach might be to measure the cosine similarity of each of these vectors against the median vector.
In the general case, Cosine similarity returns values between -1 and 1, though in many (most?) practical situations in which it is used, the possible values are constrained between 0 and 1.
Second, the formula for cosine similarity is dot(a, b)/(norm(a) x norm(b)); NumPy has a dot function, however, inner is the NumPy function that implements the dot product.
Upvotes: 5
Reputation: 21055
from __future__ import division
from itertools import izip, repeat
import math
def weighted_mean(values, weights=None):
total = 0
number = 0
if weights is None:
weights = repeat(1)
for weight, value in izip(weights, values):
total += weight * value
number += weight
return number and total / number
xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
print "Option 1, if you want bigger numbers to have a bigger effect on the score"
weights = (math.sqrt(abs(a) * abs(b)) for a, b in izip(xa, xb))
scores = (abs(a) + abs(b) and abs(a - b) / (abs(a) + abs(b)) for a, b in izip(xa, xb))
final_score = weighted_mean(scores, weights)
print "%.02f%%" % (final_score * 100)
print "Option 2, if you want to have all numbers have the same effect on the score"
scores = (abs(a) + abs(b) and abs(a - b) / (abs(a) + abs(b)) for a, b in izip(xa, xb))
final_score = weighted_mean(scores)
print "%.02f%%" % (final_score * 100)
Of course, you can also use other kinds of weights, such as (abs(a) + abs(b)) / 2
, depending on how you want to interpret a given difference.
Loopless version of the second one:
xan = numpy.array(xa)
xbn = numpy.array(xb)
error_threshold = 0.000001
final_score = numpy.mean((abs(xan - xbn) + error_threshold) / (abs(xan) + abs(xbn) + error_threshold))
Or the first:
scores = (abs(xan - xbn) + error_threshold) / (abs(xan) + abs(xbn) + error_threshold)
weights = numpy.sqrt(abs(xan) * abs(xbn))
final_score = numpy.sum(scores * weights) / numpy.sum(weights)
Upvotes: 0
Reputation: 15892
this what you are looking for?
xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
xc = []
for i in range(0, len(xa)-1):
xc.append(xa[i] - xb[i])
print xc
output:
[0, 1, -1, -2, 283, -43, 3, 4381, 0, 40, 4]
EDIT:
why not take the percent difference of each value then average them all:
from statlib import stats
xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
xc = []
for i in range(0, len(xa)-1):
xc.append(abs(float(xa[i] - xb[i])/(xa[i] + xb[i])/2))
print stats.mean(xc)*100
if you don't have statlib you can get it here
Upvotes: 1
Reputation: 80770
It depends a lot on what you are trying to do. For example, to give a yet given example, you could imagine counting the elements in one but not both sets (length of the symmetric difference of two sets) - if the numbers correspond to measurements, it would be obviously very bad.
You say time series, so can we assume the order matters ?
For time series, it is often beneficial to compute things in the spectral domain, which is something else to consider. Something with a single number is unlikely to give you much information
Upvotes: 2
Reputation: 40894
This very much depends on the nature of 'similarity' you're seeking.
Two measures spring to my mind.
sqrt((X[i]-Y[i])^2)
or of abs(X[i]-Y[i])
, normalize to the range of X and Y, that is, from min(X, Y) to max(X, Y). The closer to 0, the more similar are the data sets. The sqrt version is more sensitive to small differences.Upvotes: 3