Reputation: 431
Let's say I have time series data of City A, City B, City C & City D that looks like this:
+------------+--------+--------+--------+--------+
| Dates | City A | City B | City C | City D |
+------------+--------+--------+--------+--------+
| 2020-01-01 | 10 | 20 | 20 | 30 |
+------------+--------+--------+--------+--------+
| 2020-01-02 | 20 | 30 | 30 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-03 | 30 | 40 | 20 | 20 |
+------------+--------+--------+--------+--------+
| 2020-01-04 | 40 | 20 | 15 | 40 |
+------------+--------+--------+--------+--------+
| 2020-01-05 | 50 | 40 | 18 | 10 |
+------------+--------+--------+--------+--------+
| 2020-01-06 | 60 | 50 | 20 | 15 |
+------------+--------+--------+--------+--------+
| 2020-01-07 | 70 | 60 | 40 | 72 |
+------------+--------+--------+--------+--------+
| 2020-01-08 | 50 | 80 | 60 | 90 |
+------------+--------+--------+--------+--------+
| 2020-01-09 | 30 | 30 | 90 | 17 |
+------------+--------+--------+--------+--------+
| 2020-01-10 | 60 | 50 | 18 | 15 |
+------------+--------+--------+--------+--------+
I would like to calculate the cosine & euclidean distance between A&B, A&C, A&D, respectively,by aligning the time index.
For example, to calculate the euclidean distance between City A & City B, I would calculate the euclidean distance for their 2020-01-01 data, 2020-01-02 data, 2020-01-03 data ... and then add all of those together, to get to the final euclidean distance between City A & City B.
What is an elegant way to write a R function that performs this task?
Then, if my data starts to include more variables:
+------------+-------+------+------+------+
| Dates | City | Var1 | Var2 | Var3 |
+------------+-------+------+------+------+
| 2020-01-01 | A | 20 | 200 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | A | 30 | 300 | 3 |
+------------+-------+------+------+------+
| 2020-01-03 | A | 40 | 220 | 4 |
+------------+-------+------+------+------+
| 2020-01-04 | A | 20 | 150 | 2 |
+------------+-------+------+------+------+
| 2020-01-05 | A | 40 | 180 | 5 |
+------------+-------+------+------+------+
| 2020-01-01 | B | 50 | 200 | 6 |
+------------+-------+------+------+------+
| 2020-01-02 | B | 60 | 400 | 7 |
+------------+-------+------+------+------+
| 2020-01-03 | B | 80 | 600 | 8 |
+------------+-------+------+------+------+
| 2020-01-04 | B | 30 | 900 | 4 |
+------------+-------+------+------+------+
| 2020-01-05 | B | 50 | 180 | 2 |
+------------+-------+------+------+------+
| 2020-01-01 | C | 20 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-02 | C | 30 | 340 | 5 |
+------------+-------+------+------+------+
| 2020-01-03 | C | 40 | 230 | 3 |
+------------+-------+------+------+------+
| 2020-01-04 | C | 20 | 120 | 5 |
+------------+-------+------+------+------+
| 2020-01-05 | C | 40 | 120 | 4 |
+------------+-------+------+------+------+
| 2020-01-01 | D | 20 | 400 | 5 |
+------------+-------+------+------+------+
| 2020-01-02 | D | 30 | 500 | 6 |
+------------+-------+------+------+------+
| 2020-01-03 | D | 10 | 600 | 7 |
+------------+-------+------+------+------+
| 2020-01-04 | D | 50 | 3O0 | 7 |
+------------+-------+------+------+------+
| 2020-01-05 | D | 20 | 300 | 4 |
+------------+-------+------+------+------+
Using the same example above, to calculate the euclidean distance between City A & City B, I would calculate the euclidean distance for their 2020-01-01 data, 2020-01-02 data, 2020-01-03 data for Variable 1 -> repeat this process for Variable 2 & Variable 3. Then, finally add all of those together, to get to the total euclidean distance between City A & City B.
I am wondering if such distance calculation is technically feasible, and if so, how do I write a R function that performs these tasks for euclidean & cosine distance, for 1 single variable of interest & multiple variables of interests, respectively?
Much appreciation for your help!
Upvotes: 0
Views: 465
Reputation: 21947
I edited the post to include the cosine distance. First, let's make the first data set above.
dat <- tibble::tribble(~Dates, ~`City A`, ~`City B`, ~`City C`, ~`City D`,
"2020-01-01" , 10 , 20 , 20 , 30,
"2020-01-02" , 20 , 30 , 30 , 40,
"2020-01-03" , 30 , 40 , 20 , 20,
"2020-01-04" , 40 , 20 , 15 , 40,
"2020-01-05" , 50 , 40 , 18 , 10,
"2020-01-06" , 60 , 50 , 20 , 15,
"2020-01-07" , 70 , 60 , 40 , 72,
"2020-01-08" , 50 , 80 , 60 , 90,
"2020-01-09" , 30 , 30 , 90 , 17,
"2020-01-10" , 60 , 50 , 18 , 15)
dat$Dates <- lubridate::ymd(dat$Dates)
Then, we can re-arrange the data into variables in columns and define the function that will create the distance. Because we're going to use it with outer()
it will take two arguments that I will use to be the two different rows of the X
matrix.
X <- dat %>% select(-Dates) %>% as.matrix %>% t
edfun <- function(x,y){
sum(sqrt((X[x, ] - X[y,])^2))
}
Now, we can calculate the distances and print them:
o1 <- outer(1:nrow(X), 1:nrow(X), Vectorize(edfun))
rownames(o1) <- colnames(o1) <- rownames(X)
o1
# City A City B City C City D
# City A 0 120 269 235
# City B 120 0 209 195
# City C 269 209 0 196
# City D 235 195 196 0
Now, we can make a cosine distance function and estimate those distances.
cdfun <- function(x,y){
num <- sum(X[x,]*X[y, ])
d1 <- sqrt(sum(X[x, ]^2))
d2 <- sqrt(sum(X[y, ]^2))
num/(d1*d2)
}
o1a <- outer(1:nrow(X), 1:nrow(X), Vectorize(cdfun))
rownames(o1a) <- colnames(o1a) <- rownames(X)
o1a
# City A City B City C City D
# City A 1.0000000 0.9521640 0.7400186 0.7913705
# City B 0.9521640 1.0000000 0.8109673 0.8805258
# City C 0.7400186 0.8109673 1.0000000 0.7674460
# City D 0.7913705 0.8805258 0.7674460 1.0000000
We can do those same things for the longer data:
dat2 <- tibble::tribble( ~Dates , ~City , ~Var1, ~Var2, ~Var3,
"2020-01-01" , "A" , 20 , 200 , 5 ,
"2020-01-02" , "A" , 30 , 300 , 3 ,
"2020-01-03" , "A" , 40 , 220 , 4 ,
"2020-01-04" , "A" , 20 , 150 , 2 ,
"2020-01-05" , "A" , 40 , 180 , 5 ,
"2020-01-01" , "B" , 50 , 200 , 6 ,
"2020-01-02" , "B" , 60 , 400 , 7 ,
"2020-01-03" , "B" , 80 , 600 , 8 ,
"2020-01-04" , "B" , 30 , 900 , 4 ,
"2020-01-05" , "B" , 50 , 180 , 2 ,
"2020-01-01" , "C" , 20 , 230 , 3 ,
"2020-01-02" , "C" , 30 , 340 , 5 ,
"2020-01-03" , "C" , 40 , 230 , 3 ,
"2020-01-04" , "C" , 20 , 120 , 5 ,
"2020-01-05" , "C" , 40 , 120 , 4 ,
"2020-01-01" , "D" , 20 , 400 , 5 ,
"2020-01-02" , "D" , 30 , 500 , 6 ,
"2020-01-03" , "D" , 10 , 600 , 7 ,
"2020-01-04" , "D" , 50 , 300 , 7 ,
"2020-01-05" , "D" , 20 , 300 , 4 )
d2w <- dat2 %>%
pivot_wider(names_from="Dates",
values_from=c("Var1", "Var2", "Var3"))
X2 <- d2w %>% select(-City) %>% as.matrix
rownames(X2) <- paste0("City ", d2w$City)
edfun2 <- function(x,y){
sum(sqrt((X2[x, ] - X2[y,])^2))
}
o2 <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(edfun2))
rownames(o2) <- colnames(o2) <- rownames(X2)
o2
# City A City B City C City D
# City A 0 1364 179 1142
# City B 1364 0 1433 1208
# City C 179 1433 0 1149
# City D 1142 1208 1149 0
cdfun2 <- function(x,y){
num <- sum(X2[x,]*X2[y, ])
d1 <- sqrt(sum(X2[x, ]^2))
d2 <- sqrt(sum(X2[y, ]^2))
num/(d1*d2)
}
o2a <- outer(1:nrow(X2), 1:nrow(X2), Vectorize(cdfun2))
rownames(o2a) <- colnames(o2a) <- rownames(X2)
o2a
# City A City B City C City D
# City A 1.0000000 0.8051685 0.9861522 0.9742238
# City B 0.8051685 1.0000000 0.7617596 0.8338688
# City C 0.9861522 0.7617596 1.0000000 0.9637144
# City D 0.9742238 0.8338688 0.9637144 1.0000000
Upvotes: 1