Reputation: 1

Similarity percentage between two datasets

How to find the similarity (not correlation) between two datasets?

I am having trouble finding the similarity between matching datasets. I have one main dataset and I want to test multiple datasets of the same length and relative time-series against it to find which is the closest match in terms of sequence, day to day similarities and difference, closest value horizontally, etc.

I know Pearson R is incorrect because if the values are incremented by 1 from x1 to x2 for instance:

I get a R correlation of 1. Which in reality they aren't a perfect match because the data points aren't the same, that is why I know I am not looking for correlation. (each dataset is independent).

Here is a sample of the two time series columns i am trying to find the percentage of similarity for.

     Day   x1 x2 
      1     8  7
      2     7  7
      3     6  6
      4     6  5
      5     7  6
      6     5  6
      7     5  5

How do I calculate their similarity on various attributes such as the variance between x1 and x2 for each day ( day 1, 8 - 7) as well the variance between (x1, Day 1 and 2 (8-7) and (x2, Day 1 and 2 (7-7)).

Overall I want to calculate the similarity score to be based on the their sequences, and values while keep time series into consideration to make the hypothesis that these columns are similar enough to be a match or not.

Upvotes: 0

Answers (2)

AkselA

Reputation: 8836

As mentioned in the comments, you really need to give some serious thought as to what you mean by 'similarity' and what the similarity is between. Is it between sets, vectors or points in n-space? Is the space euclidean, should the triangle inequality hold?

For reading, Metrics could be a good place to start, or for a slightly different angle, something on the Jaccard and similar indices. Alternatively you can think of the problem as comparing the similarities between words, in which case you'd be considering the edit distance.

When it comes to R a distance matrix can be made using dist(). I took the liberty to expand your matrix with a few columns.

m <- as.matrix(read.table(text="
  x1 x2 x3 x4 x5 x6
   8  9  8  8  7  5
   7  8  8  8  8  6
   6  7  7  8  9  8
   6  7  6  5  4  4
   7  8  8  9  8  7
   5  6  7  6  5  6
   5  6  6  5  5  4", header=TRUE))

dist() compares between rows, so the original matrix has to be transposed.

m.dist <- as.matrix(dist(t(m), method="euclidean"))

If you're only interested in the similarities between adjacent columns the relevant diagonal can be extracted like this

m.dist[row(m.dist) == col(m.dist)+1]
# 2.645751 1.732051 2.236068 2.236068 3.464102

While dist() gives a good selection of distance methods, if you want to try other dissimilarity methods many can be found in vegdist() in package vegan. F.ex

m.diss <- as.matrix(vegdist(t(m), method="jaccard"))
m.diss[row(m.diss) == col(m.diss)+1]
# 0.13725490 0.05769231 0.09615385 0.10000000 0.17021277

Upvotes: 1

amonk

Reputation: 1795

Let's say that the data are in the form of

dt<-data.table(Day=seq(1:7),x1=sample(7,replace = F),x2=sample(7,replace=F))
>dt
  Day x1 x2
1:   1  5  4
2:   2  7  5
3:   3  4  7
4:   4  1  1
5:   5  3  2
6:   6  2  6
7:   7  6  3

Then:

dt[,.(std=sd(c(x1,x2))),by=1:nrow(dt)]

   nrow       std
1:    1 0.7071068
2:    2 1.4142136
3:    3 2.1213203
4:    4 0.0000000
5:    5 0.7071068
6:    6 2.8284271
7:    7 2.1213203

calculates the std per day. If a similarity function is given, then we can perform the action per pair.

Upvotes: 0

Similarity percentage between two datasets

Answers (2)

Related Questions