Reputation: 11
I am working with a large data set, and have written working code using for loops. I want to optimize the efficiency of running the code due to the large data set and think there must be a way using Tidyverse. Briefly, I have a data frame with object IDs (seed IDs) and their x and y coordinates.
object <- c('A','B','C')
x <- c(147, 146, 143)
y <- c(17, 80, 155)
df_Seeds <- data.frame(object, x, y)
df_Seeds$object <- as.character(df_Seeds$object)
I also have an array with another set of objects (radicles) and their x and y coordinates.
x1 <- c(180, 146, 143, 17, 17, 155, 30, 30, 30)
array_Radicles <- array(x1,dim = c(3,3))
The following code outputs an array with the index of any radicle objects within a certain distance of each seed and another array with the seed object ID. Lastly, I cbind the arrays.
seedID_Array <-array(dim=(0:1)) #blank array for seedID
radicleIndex_Array <-array(dim=(0:1)) #blank array for radicle index
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
for (j in 1:length(indexRadicles)) { #loops through each radicle index
singleIndexRadicles <- indexRadicles[j]
seedID_Array <- rbind(seedID_Array, df_Seeds[i,1]) #adds seed object ID to array
radicleIndex_Array <- rbind(radicleIndex_Array, singleIndexRadicles) #adds radicle index to array
}
}
}
combinedArray <- cbind(seedID_Array, radicleIndex_Array)
I appreciate any suggestions or direction to another similar problem that has been solved.
Upvotes: 1
Views: 94
Reputation: 66570
Your question is an example of a non-equi join, where your distance requirement constrains the matches between two tables. dplyr
does not currently allow non-equi joins, but in many cases my data is small enough (eg cartesian product table still fits in memory) for a brute force method to work fine and be fast enough.
Option 1: cartesian product then filter
Here I join every radicle to every seedID (this could be untenable if your data is big enough) and then filter out the ones I don't need.
library(tidyverse)
df_Radicles <- tibble(x = array_Radicles[,1],
y = array_Radicles[,2],
misc = array_Radicles[,3],
rad_idx = 1:length(array_Radicles[,1]))
# brute force non-equi join: join all then filter
crossing(object = df_Seeds$object, df_Radicles) %>%
left_join(df_Seeds, by = "object") %>%
filter(abs(x.x - x.y) <= 50, abs(y.x - y.y) <= 25) %>%
select(object, rad_idx)
# A tibble: 3 x 2
object rad_idx
<chr> <int>
1 A 2
2 A 1
3 C 3
Option 2: fuzzyjoin
The fuzzyjoin
package allows non-equi joins, and has built in methods for distance joins. In this case you're using a manhattan distance metric, but since your y distance is different I scale it here * 2
so that it can be evaluated on the same +/- 50 scale as your x distance. There's also a geo_join
option if you're dealing with lat/lon coordinates.
library(fuzzyjoin)
df_Seeds %>%
mutate(y = y * 2) %>% # to use manhattan distance with x + y on same scale
distance_inner_join(
df_Radicles %>% mutate(y = y*2),
by = c("x", "y"),
method = "manhattan",
max_dist = 50) %>%
select(object, rad_idx)
object rad_idx
1 A 1
2 A 2
3 C 3
If these approaches aren't performant on your data, I'd recommend using data.table
, which is phenomenally fast for this sort of thing.
Upvotes: 1
Reputation: 1751
Well, first of all, I highly recommend you take a look at this resource and this other resource. They're both good references on how to avoid some simple mistakes and get your R code going faster.
For your specific problem, I'd say the biggest performance bottleneck is when you're using rbind
and cbind
. Those functions create a copy of your original object and then fill in with the second argument. This is not very efficient.
Also, in your inner loop, you're essentially adding all the indexes in indexRadicles
and repeteadly rbind
ing df_Seeds[i,1]
.
To solve this, a possible solution would be to use a list, indexing by Seed ID. For example:
output <- list()
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
output[df_Seeds$object[i]] <- list(indexRadicles)
}
}
seeds_that_had_index_radicles <- names(output)
all_index_radicles <- unlist(output)
Note that we did not use any tidyverse
solution here. I believe it is wrong to even assume tidyverse
solutions to be always faster or more efficient. I personally think that they help you understand some operations better, or at least visualize them better. But you can usually do the same things with the same performance using base R.
Bonus: On a side note, you can always use profvis
to help you find out the performance bottlenecks in your code. It will you show what lines are taking longer, or what lines are being called the most. Highly recommend taking a look at it: https://rstudio.github.io/profvis/
Upvotes: 1