Reputation: 1303

Difference between data sets

I have a two data sets as lists, for example:

xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]    
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]

Series are market data that may contain tens of thousands numbers, their length is same.

I need to find "difference" in percents, that shows "how much similarity/dissimilarity between series in percents".
Currently I have an idea to build charts for every list (xa, xb as Y ax, and range(1, len(xa)) as X ax). interpolate functions for xa, xb, then calculate the area of xa, xb (with integration) and area of difference between xa and xb. After this the dissimilarity is (difference area)*100%/(xa area + xb area).

I wonder if this question has more simple solution. If not - how can I calculate difference area of xa, xb? Charts are build with scipy, numpy, matplotlib.

update: I'm looking for ONE number that represent the difference between sets. Percents are preferred.

Upvotes: 0

Answers (5)

doug

Reputation: 70068

Well if you want a similarity metric for comparing two 1D vectors and that preferably returns a value between 0 and 1 (or 0 and 100%), cosine similarity satisfies those criteria (subject to the proviso at the end). (Whether it's appropriate given the context of your problem, i don't know, but you know the context, so you can certainly make that determination.)

import numpy as NP
import numpy.linalg as LA

# generate some data
fnx = lambda : NP.random.randint(0, 10, 10)
s1, s2 = fnx(), fnx()

# a function to calculate cosine similarity
cx = lambda a, b : round(NP.inner(a, b)/(LA.norm(a)*LA.norm(b)), 2)

cx(s1, s2)
# returns 0.85

If you have many 1D vectors, then one approach might be to measure the cosine similarity of each of these vectors against the median vector.

In the general case, Cosine similarity returns values between -1 and 1, though in many (most?) practical situations in which it is used, the possible values are constrained between 0 and 1.

Second, the formula for cosine similarity is dot(a, b)/(norm(a) x norm(b)); NumPy has a dot function, however, inner is the NumPy function that implements the dot product.

Upvotes: 5

Rosh Oxymoron

Reputation: 21055

from __future__ import division
from itertools import izip, repeat
import math

def weighted_mean(values, weights=None):
    total = 0
    number = 0
    if weights is None:
        weights = repeat(1)
    for weight, value in izip(weights, values):
        total += weight * value
        number += weight
    return number and total / number

xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]    
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]


print "Option 1, if you want bigger numbers to have a bigger effect on the score"

weights = (math.sqrt(abs(a) * abs(b)) for a, b in izip(xa, xb))
scores = (abs(a) + abs(b) and abs(a - b) / (abs(a) + abs(b)) for a, b in izip(xa, xb))

final_score = weighted_mean(scores, weights)
print "%.02f%%" % (final_score * 100)


print "Option 2, if you want to have all numbers have the same effect on the score"

scores = (abs(a) + abs(b) and abs(a - b) / (abs(a) + abs(b)) for a, b in izip(xa, xb))

final_score = weighted_mean(scores)
print "%.02f%%" % (final_score * 100)

Of course, you can also use other kinds of weights, such as (abs(a) + abs(b)) / 2, depending on how you want to interpret a given difference.

Loopless version of the second one:

xan = numpy.array(xa)
xbn = numpy.array(xb)
error_threshold = 0.000001
final_score = numpy.mean((abs(xan - xbn) + error_threshold) / (abs(xan) + abs(xbn) + error_threshold))

Or the first:

scores = (abs(xan - xbn) + error_threshold) / (abs(xan) + abs(xbn) + error_threshold)
weights = numpy.sqrt(abs(xan) * abs(xbn))
final_score = numpy.sum(scores * weights) / numpy.sum(weights)

Upvotes: 0

Richard

Reputation: 15892

this what you are looking for?

xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]    
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
xc = []

for i in range(0, len(xa)-1):
    xc.append(xa[i] - xb[i])

print xc

output:

[0, 1, -1, -2, 283, -43, 3, 4381, 0, 40, 4]

EDIT:

why not take the percent difference of each value then average them all:

from statlib import stats

xa = [1, 2, 3, 10, 1383, 0, 12, 9229, 2, 494, 10, 49]    
xb = [1, 1, 4, 12, 1100, 43, 9, 4848, 2, 454, 6, 9]
xc = []


for i in range(0, len(xa)-1):
    xc.append(abs(float(xa[i] - xb[i])/(xa[i] + xb[i])/2))

print stats.mean(xc)*100

if you don't have statlib you can get it here

Upvotes: 1

David Cournapeau

Reputation: 80770

It depends a lot on what you are trying to do. For example, to give a yet given example, you could imagine counting the elements in one but not both sets (length of the symmetric difference of two sets) - if the numbers correspond to measurements, it would be obviously very bad.

You say time series, so can we assume the order matters ?

For time series, it is often beneficial to compute things in the spectral domain, which is something else to consider. Something with a single number is unlikely to give you much information

Upvotes: 2

9000

Reputation: 40894

This very much depends on the nature of 'similarity' you're seeking.

Two measures spring to my mind.

Calculate sum of sqrt((X[i]-Y[i])^2) or of abs(X[i]-Y[i]), normalize to the range of X and Y, that is, from min(X, Y) to max(X, Y). The closer to 0, the more similar are the data sets. The sqrt version is more sensitive to small differences.
Calculate correlation, it will give you a measure from +1 as 'completely similar' to -1 as 'completely dissimilar'. Note that this 'similarity' does not necessarily means that your series follow each other neatly. Take a look at the picture in the wikipedia article.

Upvotes: 3

Difference between data sets

Answers (5)

Related Questions