James Eckhardt
James Eckhardt

Reputation: 11

R optimize loop with Tidyverse

I am working with a large data set, and have written working code using for loops. I want to optimize the efficiency of running the code due to the large data set and think there must be a way using Tidyverse. Briefly, I have a data frame with object IDs (seed IDs) and their x and y coordinates.

object <- c('A','B','C')
x <- c(147, 146, 143)
y <- c(17, 80, 155)
df_Seeds <-  data.frame(object, x, y)
df_Seeds$object <- as.character(df_Seeds$object)

I also have an array with another set of objects (radicles) and their x and y coordinates.

x1 <- c(180, 146, 143, 17, 17, 155, 30, 30, 30)
array_Radicles <- array(x1,dim = c(3,3))

The following code outputs an array with the index of any radicle objects within a certain distance of each seed and another array with the seed object ID. Lastly, I cbind the arrays.

seedID_Array <-array(dim=(0:1)) #blank array for seedID
radicleIndex_Array <-array(dim=(0:1)) #blank array for radicle index
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
  indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
  if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
    for (j in 1:length(indexRadicles)) { #loops through each radicle index
      singleIndexRadicles <- indexRadicles[j] 
      seedID_Array <- rbind(seedID_Array, df_Seeds[i,1]) #adds seed object ID to array
      radicleIndex_Array <- rbind(radicleIndex_Array, singleIndexRadicles) #adds radicle index to array
    }
  }
}
combinedArray <- cbind(seedID_Array, radicleIndex_Array)

I appreciate any suggestions or direction to another similar problem that has been solved.

Upvotes: 1

Views: 94

Answers (2)

Jon Spring
Jon Spring

Reputation: 66570

Your question is an example of a non-equi join, where your distance requirement constrains the matches between two tables. dplyr does not currently allow non-equi joins, but in many cases my data is small enough (eg cartesian product table still fits in memory) for a brute force method to work fine and be fast enough.

Option 1: cartesian product then filter

Here I join every radicle to every seedID (this could be untenable if your data is big enough) and then filter out the ones I don't need.

library(tidyverse)
df_Radicles <- tibble(x = array_Radicles[,1],
                          y = array_Radicles[,2],
                          misc = array_Radicles[,3], 
                      rad_idx = 1:length(array_Radicles[,1]))

# brute force non-equi join: join all then filter
crossing(object = df_Seeds$object, df_Radicles) %>%
  left_join(df_Seeds, by = "object") %>%
  filter(abs(x.x - x.y) <= 50, abs(y.x - y.y) <= 25) %>%
  select(object, rad_idx)


# A tibble: 3 x 2
  object rad_idx
  <chr>    <int>
1 A            2
2 A            1
3 C            3

Option 2: fuzzyjoin

The fuzzyjoin package allows non-equi joins, and has built in methods for distance joins. In this case you're using a manhattan distance metric, but since your y distance is different I scale it here * 2 so that it can be evaluated on the same +/- 50 scale as your x distance. There's also a geo_join option if you're dealing with lat/lon coordinates.

library(fuzzyjoin)
df_Seeds %>%
  mutate(y = y * 2) %>%   # to use  manhattan distance with x + y on same scale
  distance_inner_join(
    df_Radicles %>% mutate(y = y*2),
    by = c("x", "y"),
    method = "manhattan",
    max_dist = 50) %>%
  select(object, rad_idx)


  object rad_idx
1      A       1
2      A       2
3      C       3

If these approaches aren't performant on your data, I'd recommend using data.table, which is phenomenally fast for this sort of thing.

Upvotes: 1

eduardokapp
eduardokapp

Reputation: 1751

Well, first of all, I highly recommend you take a look at this resource and this other resource. They're both good references on how to avoid some simple mistakes and get your R code going faster.

For your specific problem, I'd say the biggest performance bottleneck is when you're using rbind and cbind. Those functions create a copy of your original object and then fill in with the second argument. This is not very efficient.

Also, in your inner loop, you're essentially adding all the indexes in indexRadicles and repeteadly rbinding df_Seeds[i,1].

To solve this, a possible solution would be to use a list, indexing by Seed ID. For example:

output <- list()
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
    indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
    if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
        output[df_Seeds$object[i]] <- list(indexRadicles)
    }
}
seeds_that_had_index_radicles <- names(output)
all_index_radicles <- unlist(output)

Note that we did not use any tidyverse solution here. I believe it is wrong to even assume tidyverse solutions to be always faster or more efficient. I personally think that they help you understand some operations better, or at least visualize them better. But you can usually do the same things with the same performance using base R.

Bonus: On a side note, you can always use profvis to help you find out the performance bottlenecks in your code. It will you show what lines are taking longer, or what lines are being called the most. Highly recommend taking a look at it: https://rstudio.github.io/profvis/

Upvotes: 1

Related Questions