DPatrick
DPatrick

Reputation: 431

R studio: Is there a way to calculate the cosine & euclidean distance between 2 time series with a single & multiple variables of interest?

Let's say I have time series data of City A, City B, City C & City D that looks like this:

+------------+--------+--------+--------+--------+
| Dates      | City A | City B | City C | City D |
+------------+--------+--------+--------+--------+
| 2020-01-01 | 10     | 20     | 20     | 30     |
+------------+--------+--------+--------+--------+
| 2020-01-02 | 20     | 30     | 30     | 40     |
+------------+--------+--------+--------+--------+
| 2020-01-03 | 30     | 40     | 20     | 20     |
+------------+--------+--------+--------+--------+
| 2020-01-04 | 40     | 20     | 15     | 40     |
+------------+--------+--------+--------+--------+
| 2020-01-05 | 50     | 40     | 18     | 10     |
+------------+--------+--------+--------+--------+
| 2020-01-06 | 60     | 50     | 20     | 15     |
+------------+--------+--------+--------+--------+
| 2020-01-07 | 70     | 60     | 40     | 72     |
+------------+--------+--------+--------+--------+
| 2020-01-08 | 50     | 80     | 60     | 90     |
+------------+--------+--------+--------+--------+
| 2020-01-09 | 30     | 30     | 90     | 17     |
+------------+--------+--------+--------+--------+
| 2020-01-10 | 60     | 50     | 18     | 15     |
+------------+--------+--------+--------+--------+

I would like to calculate the cosine & euclidean distance between A&B, A&C, A&D, respectively,by aligning the time index.

For example, to calculate the euclidean distance between City A & City B, I would calculate the euclidean distance for their 2020-01-01 data, 2020-01-02 data, 2020-01-03 data ... and then add all of those together, to get to the final euclidean distance between City A & City B.

What is an elegant way to write a R function that performs this task?

Then, if my data starts to include more variables:

+------------+-------+------+------+------+
| Dates      | City  | Var1 | Var2 | Var3 |
+------------+-------+------+------+------+
| 2020-01-01 | A     | 20   | 200  | 5    |
+------------+-------+------+------+------+
| 2020-01-02 | A     | 30   | 300  | 3    |
+------------+-------+------+------+------+
| 2020-01-03 | A     | 40   | 220  | 4    |
+------------+-------+------+------+------+
| 2020-01-04 | A     | 20   | 150  | 2    |
+------------+-------+------+------+------+
| 2020-01-05 | A     | 40   | 180  | 5    |
+------------+-------+------+------+------+
| 2020-01-01 | B     | 50   | 200  | 6    |
+------------+-------+------+------+------+
| 2020-01-02 | B     | 60   | 400  | 7    |
+------------+-------+------+------+------+
| 2020-01-03 | B     | 80   | 600  | 8    |
+------------+-------+------+------+------+
| 2020-01-04 | B     | 30   | 900  | 4    |
+------------+-------+------+------+------+
| 2020-01-05 | B     | 50   | 180  | 2    |
+------------+-------+------+------+------+
| 2020-01-01 | C     | 20   | 230  | 3    |
+------------+-------+------+------+------+
| 2020-01-02 | C     | 30   | 340  | 5    |
+------------+-------+------+------+------+
| 2020-01-03 | C     | 40   | 230  | 3    |
+------------+-------+------+------+------+
| 2020-01-04 | C     | 20   | 120  | 5    |
+------------+-------+------+------+------+
| 2020-01-05 | C     | 40   | 120  | 4    |
+------------+-------+------+------+------+
| 2020-01-01 | D     | 20   | 400  | 5    |
+------------+-------+------+------+------+
| 2020-01-02 | D     | 30   | 500  | 6    |
+------------+-------+------+------+------+
| 2020-01-03 | D     | 10   | 600  | 7    |
+------------+-------+------+------+------+
| 2020-01-04 | D     | 50   | 3O0  | 7    |
+------------+-------+------+------+------+
| 2020-01-05 | D     | 20   | 300  | 4    |
+------------+-------+------+------+------+

Using the same example above, to calculate the euclidean distance between City A & City B, I would calculate the euclidean distance for their 2020-01-01 data, 2020-01-02 data, 2020-01-03 data for Variable 1 -> repeat this process for Variable 2 & Variable 3. Then, finally add all of those together, to get to the total euclidean distance between City A & City B.

I am wondering if such distance calculation is technically feasible, and if so, how do I write a R function that performs these tasks for euclidean & cosine distance, for 1 single variable of interest & multiple variables of interests, respectively?

Much appreciation for your help!

Upvotes: 0

Views: 465

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21947

I edited the post to include the cosine distance. First, let's make the first data set above.

dat <- tibble::tribble(~Dates, ~`City A`, ~`City B`,  ~`City C`, ~`City D`,
                       "2020-01-01" ,  10     ,  20     , 20     , 30, 
                       "2020-01-02" ,  20     ,  30     , 30     , 40, 
                       "2020-01-03" ,  30     ,  40     , 20     , 20, 
                       "2020-01-04" ,  40     ,  20     , 15     , 40, 
                       "2020-01-05" ,  50     ,  40     , 18     , 10, 
                       "2020-01-06" ,  60     ,  50     , 20     , 15, 
                       "2020-01-07" ,  70     ,  60     , 40     , 72, 
                       "2020-01-08" ,  50     ,  80     , 60     , 90, 
                       "2020-01-09" ,  30     ,  30     , 90     , 17, 
                       "2020-01-10" ,  60     ,  50     , 18     , 15) 

dat$Dates <- lubridate::ymd(dat$Dates)

Then, we can re-arrange the data into variables in columns and define the function that will create the distance. Because we're going to use it with outer() it will take two arguments that I will use to be the two different rows of the X matrix.

X <- dat %>% select(-Dates) %>% as.matrix %>% t

edfun <- function(x,y){
  sum(sqrt((X[x, ] - X[y,])^2))
}

Now, we can calculate the distances and print them:

o1 <- outer(1:nrow(X), 1:nrow(X), Vectorize(edfun))
rownames(o1) <- colnames(o1) <- rownames(X)
o1
#        City A City B City C City D
# City A      0    120    269    235
# City B    120      0    209    195
# City C    269    209      0    196
# City D    235    195    196      0

Now, we can make a cosine distance function and estimate those distances.

cdfun <- function(x,y){
  num <- sum(X[x,]*X[y, ])
  d1 <- sqrt(sum(X[x, ]^2))
  d2 <- sqrt(sum(X[y, ]^2))
  num/(d1*d2)
}
o1a <- outer(1:nrow(X), 1:nrow(X), Vectorize(cdfun))
rownames(o1a) <- colnames(o1a) <- rownames(X)
o1a
#           City A    City B    City C    City D
# City A 1.0000000 0.9521640 0.7400186 0.7913705
# City B 0.9521640 1.0000000 0.8109673 0.8805258
# City C 0.7400186 0.8109673 1.0000000 0.7674460
# City D 0.7913705 0.8805258 0.7674460 1.0000000

We can do those same things for the longer data:


dat2 <- tibble::tribble( ~Dates    ,  ~City ,  ~Var1,  ~Var2,  ~Var3, 
                         "2020-01-01" , "A"     , 20   , 200  , 5    ,
                         "2020-01-02" , "A"     , 30   , 300  , 3    ,
                         "2020-01-03" , "A"     , 40   , 220  , 4    ,
                         "2020-01-04" , "A"     , 20   , 150  , 2    ,
                         "2020-01-05" , "A"     , 40   , 180  , 5    ,
                         "2020-01-01" , "B"     , 50   , 200  , 6    ,
                         "2020-01-02" , "B"     , 60   , 400  , 7    ,
                         "2020-01-03" , "B"     , 80   , 600  , 8    ,
                         "2020-01-04" , "B"     , 30   , 900  , 4    ,
                         "2020-01-05" , "B"     , 50   , 180  , 2    ,
                         "2020-01-01" , "C"     , 20   , 230  , 3    ,
                         "2020-01-02" , "C"     , 30   , 340  , 5    ,
                         "2020-01-03" , "C"     , 40   , 230  , 3    ,
                         "2020-01-04" , "C"     , 20   , 120  , 5    ,
                         "2020-01-05" , "C"     , 40   , 120  , 4    ,
                         "2020-01-01" , "D"     , 20   , 400  , 5    ,
                         "2020-01-02" , "D"     , 30   , 500  , 6    ,
                         "2020-01-03" , "D"     , 10   , 600  , 7    ,
                         "2020-01-04" , "D"     , 50   , 300  , 7    ,
                         "2020-01-05" , "D"     , 20   , 300  , 4  )


d2w <- dat2 %>% 
  pivot_wider(names_from="Dates", 
              values_from=c("Var1", "Var2", "Var3"))

X2 <- d2w %>% select(-City) %>% as.matrix
rownames(X2) <- paste0("City ", d2w$City)
edfun2 <- function(x,y){
  sum(sqrt((X2[x, ] - X2[y,])^2))
}

o2 <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(edfun2))
rownames(o2) <- colnames(o2) <- rownames(X2)
o2
#        City A City B City C City D
# City A      0   1364    179   1142
# City B   1364      0   1433   1208
# City C    179   1433      0   1149
# City D   1142   1208   1149      0



cdfun2 <- function(x,y){
  num <- sum(X2[x,]*X2[y, ])
  d1 <- sqrt(sum(X2[x, ]^2))
  d2 <- sqrt(sum(X2[y, ]^2))
  num/(d1*d2)
}

o2a <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(cdfun2))
rownames(o2a) <- colnames(o2a) <- rownames(X2)
o2a
#           City A    City B    City C    City D
# City A 1.0000000 0.8051685 0.9861522 0.9742238
# City B 0.8051685 1.0000000 0.7617596 0.8338688
# City C 0.9861522 0.7617596 1.0000000 0.9637144
# City D 0.9742238 0.8338688 0.9637144 1.0000000

Upvotes: 1

Related Questions