Reputation: 197

Speed up for loop in R, calculating pairwise dissimilarities

I'm trying to compute all the pairwise dissimilarities between observations in a data set consisting of only nominal variables using some self-defined dissimilarity metric.

Data looks like

set.seed(3424)

(mydata <- data.table(paste(sample(letters[1:5], 5, replace=T), 
                        sample(LETTERS[1:5], 5, replace=T), 
                        sep = ","), 
                  paste(sample(LETTERS[1:5], 5, replace=T), 
                        sample(LETTERS[1:5], 5, replace=T), 
                        sep = ","), 
                  paste(sample(letters[1:5], 5, replace=T), 
                        sample(letters[1:5], 5, replace=T), 
                        sep = ",")))

    V1  V2  V3
1: a,A E,E b,b
2: e,D C,A d,d
3: d,B B,C d,d
4: c,B A,E b,d
5: a,B C,D d,a


library(data.table)
library(dplyr)
library(stringr)

metric <- function(pair){
    intersection <- 0
    union <- 0
    for(i in 1:ncol(mydata)){
        A <- pair[[1]][[i]]
        B <- pair[[2]][[i]]
        if(sum(is.na(A),is.na(B))==1)
            union = union + 1
        if(sum(is.na(A),is.na(B))==0){
            intersection <- intersection + length(intersect(A,B))/length(union(A,B))
            union = union + 1
        }
    }
    1 - intersection/union
}

diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))

for(i in 1:(nrow(mydata)-1)){
    print(i)                     ## to check progress ##
    for(j in (i+1):nrow(mydata)){
        pair <- rbind(mydata[i], mydata[j])
        diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
    }
}

These loops work, but really slow when mydata has 1000+ rows and 100+ columns.

The metric I used here is Jaccard index, but a nested version. Since each element in the data is not a single value. So instead of treating each two levels as either match(0) or different(1), I use Jaccard when comparing levels as well.

Update:

Some context about my data, not the toy data I made up.

Each row represents a query, i.e. "SELECT ... FROM ... WHERE ... ...".
Each column contains part of the information in the query, i.e. 1st column contains everything between "SELECT" and "FROM", 2nd column contains what's between "FROM" and "WHERE", etc.
There are 100 columns and 400 rows, I don't why there are so many columns though.

Number of elements in one cell could be really arbitrary, some cells contain very long lists of values, while many are actually NAs. E.g.

                                                 SELECT
  1:                                                 NA
  2:p1.PLAYERID,f1.PLAYERNAME,p2.PLAYERID,f2.PLAYERNAME
  3:                       PLAYER f1,PLAYER f2,PLAYS p1
  4:                                                 NA
  5:                                                 NA
  6:                  c1.table_name t1,c2.table_name t2
  7:                                                 NA
 ...
400:               asd,vrht,yuetr,wxeq,yiknuy,sce,ercher

Upvotes: 0

Answers (3)

Jacob H

Reputation: 4513

When constructing an algorithm it is important to keep in mind the speed/space trade off. What I mean by the speed/space trade off is that by storing your data within a different schema you can usually eliminate for loops. However, data stored within this new schema will generally occupy more space.

The reason your example is slow is because, among other things, you are looping over all the rows and the columns of you're data. With a 1000x100 data.frame that is 1e5 computations. One way to eliminate theloop over your rows is to store you data a bit differently. For example, I use the expand.grid command to combine all pairwise comparisons within the same data.frame, dTMP. I then strip the comma and allow each member of the pair to occupy it's own column (i.e. "a,A" which is originally contained in one variable, is now "a" and "A" and represent entries in two separate variables). In general, reshaping data into different formats is quick, or atleast quicker than looping over each row. This reshaping clearly, however, generate a data set which takes up more RAM. In your case the data.frame will be 1e6x4. Which is very large, but not so large as to clog up all your RAM.

The reward to doing all that hard work is that now it is trivial and extremely fast to obtain the intersect and union variables. You will of course still need to loop over each column, however, we've eliminated one loop by simply arranging your data. It is possible to remove the loop over the columns loop by utilizing 3D arrays, however, such an array would not fit into memory.

f3 <- function(){
  intersection <- 0
  for(v in names(mydata)){

    dTMP <- expand.grid(mydata[[v]], mydata[[v]], stringsAsFactors = FALSE)[,c(2,1)]

    #There is likely a more elegant way to do this.
    dTMP <-
      dTMP$Var2 %>%
      str_split(.,  ",") %>%
      unlist(.) %>%
      matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE) %>% 
      cbind(., dTMP$Var1%>%
              str_split(.,  ",") %>%
              unlist(.) %>%
              matrix(., ncol = 2, nrow = nrow(dTMP), byrow = TRUE)) %>%
      as.data.frame(., stringsAsFactors = FALSE) 


    names(dTMP) <- c("v1", "v2", "v3", "v4")

    intersect <- rowSums(dTMP[,c("v1", "v2")]  == dTMP[,c("v3", "v4")])
    intersect <- ifelse(rowSums(dTMP[,c("v1", "v2")]  == dTMP[,c("v4", "v3")]) !=0, rowSums(dTMP[,c("v1", "v2")]  == dTMP[,c("v4", "v3")]), intersect)
    intersect <- ifelse(dTMP[, "v1"] == dTMP[, "v2"], 1, intersect)

    MYunion <- sapply(as.data.frame(t(dTMP)), function(x) n_distinct(x))

    intersection <- intersection + intersect/MYunion

  }

  union <- ncol(mydata)

  return(matrix(1 - intersection/union, nrow = nrow(mydata), ncol = nrow(mydata), byrow = TRUE)) #This is the diss matrix, I think.  Double check that I got the rows and columns correct

}

Update

I'm still having trouble replicating your results, however, I believe the newly updated code is very close. There is only one cell (2,1) of the dissimilarity matrix which our results differ when set.seed(3424). The problem with the current iteration, however, is that I need to implement a sapply to obtain MYunion. If you can think of a faster way do to do this, you'll get big speed gains. Read this SO post for suggests: Efficient Means of Identifying Number of Distinct Elements in a Row

Upvotes: 1

ARobertson

Reputation: 2907

It's similar to the original, but I made a few changes. It runs more quickly, but I didn't bother timing it. 1000 with this code seems about like 100 with the original.

The main changes:

remove rbind by passing in variables to mapply calculate union
variable instead of adding every time (union <- union + 1)
split strings all at once outside of loops
check length intersection before calculating union and adding intersection (lenint > 0)

Hopefully something helps your case.

rownum <- 1000
(mydata <- data.table(paste(sample(letters[1:5], rownum, replace=T), 
                            sample(LETTERS[1:5], rownum, replace=T), 
                            sep = ","), 
                      paste(sample(LETTERS[1:5], rownum, replace=T), 
                            sample(LETTERS[1:5], rownum, replace=T), 
                            sep = ","), 
                      paste(sample(letters[1:5], rownum, replace=T), 
                            sample(letters[1:5], rownum, replace=T), 
                            sep = ",")))

allsplit <- lapply(mydata,strsplit,split = ',')
allsplitdf <- cbind(allsplit[['V1']],allsplit[['V2']],allsplit[['V3']])
allsplitlist <- split(allsplitdf,1:nrow(allsplitdf))

metric2 <- function(p1,p2){
  for(i in seq_along(p1)){
    intersection <- 0
    A <- p1[[i]]
    B <- p2[[i]]
    if(!any(is.na(A),is.na(B))){
      lenint <- length(intersect(A,B))
      if(lenint > 0){
        intersection <- intersection + lenint/length(union(A,B))
      }
    }
  }
  1 - intersection/length(p1)
}
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))
for(i in 1:(nrow(mydata)-1)){
  print(i)                     ## to check progress ##
  for(j in (i+1):nrow(mydata)){
    diss[j, i] <- mapply(metric2,p1 = allsplitlist[i],p2 = allsplitlist[j])
  }
}

Upvotes: 1

Bryan Goggin

Reputation: 2489

You can gain some speed pretty easily by doing less work. If you are only interested in pairwise comparisons, you only need to do N choose 2 comparisons, instead of N^2. You can implement that with F2() below.

set.seed(3424)
(mydata <- data.table(sample(letters[1:5], 50, replace = T),
                      sample(LETTERS[1:5], 50, replace = T),
                      sample(1:3, 50, replace = T)))

mydf<-data.frame(mydata)

f1<- function(){
diss <- matrix(nrow = nrow(mydata), ncol = nrow(mydata))

for(i in 1:(nrow(mydata)-1)){
  print(i)                     ## to check progress ##
  for(j in (i+1):nrow(mydata)){
    pair <- rbind(mydata[i], mydata[j])
    diss[j, i] <- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
    }
  }
 return(diss)
}


f2<-function(){
met<-NULL
A<-NULL
B<-NULL
choices<-choose(nrow(mydf),2)
combs<-combn(nrow(mydf),2)
  for(i in 1:choices) {
    print(i)
    pair<-rbind(mydf[combs[1,i],], mydf[combs[2,i],])
    met[i]<- apply(pair, 1, function(x) strsplit(x, split=",")) %>% metric()
    A[i]<-mydf[combs[1,i],1]
    B[i]<-mydf[combs[2,i],2]
 }
results<-data.frame(A,B, met)
return(results)
}

library(microbenchmark)
microbenchmark(f1(), f2(), times = 10)
Unit: milliseconds
expr  min     lq   mean median     uq  max neval
f1() 1381 1391.2 1416.8 1417.6 1434.9 1456    10
f2()  907  923.6  942.3  946.9  948.9 1008    10

It is a little faster, but not mind-blowingly so. My guess is that some more work needs to be done on the metric function you define. I tried to look at it and determine a way to vectorize it, but I could not find a way. If that can be done this problem would be trivial. For example, I have a similar program that measures pairwise cosine similarity between ~400 vectors of length ~5000. It has to make 400 choose 2 = 79800 comparisons and the entire program takes about 6 seconds to run.

Upvotes: 1

Speed up for loop in R, calculating pairwise dissimilarities

Answers (3)

Update

Related Questions