Forrest
Forrest

Reputation: 76

Vectorized Rolling/Cumulative Mahalanobis Distance in R

I'm trying to calculate a rolling mahalanobis distance without resorting to for loops and failing miserably.

Here is an example dataset:

df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
               date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
               valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
               valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
               valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))

I'm trying to calculate, by group (label), the mahalanobis distance of valx, valy, and valz, but only using rows from that date (date) or previous. My current solution is to loop through each label, loop through each date, filter the dataset down to the matching data, calculate distance using stats::mahalanobis, add that distance to a list, and then do.call and rbind them outside the loop*. Clearly this isn't ideal.

I suspect that there's some way to write:

cum.mdist <- function(df, cols) {...}
df %>%
  group_by(label) %>%
  arrange(date) %>%
  mutate(mdist = xapply(., c(valx, valy, valz), cum.mdist)) %>%
  ungroup()

in a similar way to calculating a rolling unary function like so:

cumsd <- function(x) sapply(seq_along(x), function(k, z) sd(z[1:k]), z = x)

I could calculate the distance from component parts if there were no covariance (rolling variance variance is simple to calculate using a function like the above one), but I think my variables do have covariance, and I'm not sure how to build a rolling covariance matrix...

Does a solution to this exist outside of for loops?


*code for a looped solution is below:

library("tidyverse")
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
                 date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
                 valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
                 valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
                 valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
mdist.list <- vector(length = nrow(df), mode = "list")
counter <- 1

for(l in seq_along(unique(df$label))){
  label_data <- df %>%
    filter(label == unique(df$label)[l])

  for(d in seq_along(unique(label_data$date))){
    label_date_data <- label_data %>%
      filter(date <= unique(label_data$date)[d])

    if(nrow(label_date_data) > 3){
      label_date_data$mdist <- mahalanobis(label_date_data %>% select(contains("val")),
                                           colMeans(label_date_data %>% select(contains("val"))),
                                           cov(label_date_data %>% select(contains("val"))))
    } else{
      label_date_data$mdist <- NA
    }

    mdist.list[[counter]] <- filter(label_date_data, 
                                    date == unique(label_data$date)[d])

    counter <- counter + 1
  }
}

mdist.df <- do.call(rbind, mdist.list)

Upvotes: 3

Views: 276

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

Not sure if I understand your requirements or desired output correctly, here is something using data.table to get you started:

library(data.table)
setDT(df)
df[, mdist := 
    .SD[, transpose(lapply(1L:.N, function(n) {
        ma <- .SD[1L:n]
        ans <- tryCatch(mahalanobis(ma, colMeans(ma), var(ma)), error=function(e) NA)
        ans[length(ans)]            
    })), by=.(label), .SDcols=valx:valz]$V1]

output:

    label       date         valx       valy        valz     mdist
 1:     A 2018-01-01  1.262954285   7.635935  -2.2426789        NA
 2:     A 2018-01-02 -0.326233361  -7.990092   3.7739565        NA
 3:     A 2018-01-03  1.329799263 -11.476570   1.3333636        NA
 4:     A 2018-01-04  1.272429321  -2.894616   8.0418951 2.2500000
 5:     A 2018-01-05  0.414641434  -2.992151  -0.5710677 0.7260652
 6:     B 2018-01-01 -1.539950042  -4.115108  15.1082392        NA
 7:     B 2018-01-02 -0.928567035   2.522234  32.5730809        NA
 8:     B 2018-01-03 -0.294720447  -8.919211 -20.7286152        NA
 9:     B 2018-01-04 -0.005767173   4.356833 -38.5379806 2.2500000
10:     B 2018-01-05  2.404653389 -12.375384   1.4017852 3.0800360

data:

set.seed(0L)
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
    date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
    valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 0, sd = 1)),
    valy = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 10)),
    valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))

I will delete this post if you are looking only for a tidyverse solution.

Upvotes: 1

Related Questions