How to speed up row level operation with dplyr

Question

I have a large dataset that I try to manipulate using dplyr. My data wrangling task requires row level string manipulation.

I am using the default rowwise() function, and the code is working. However, the operation is taking a lot of time to complete.

VR_vehicle_GPSLocation = c("12.36556|0.74518153|xxxxxxxxxx", 
      "-51.75810|165.55526|xxxxxxxxxx", 
      "GPS nicht verfügbar",
      "48.77410|171.08364|xxxxxxxxxx", 
      "GPS Not Available",
      "0|0|N/R",
      "32.18661| 170.56615|xxxxxxxxxx")
df = data.frame(VR_vehicle_GPSLocation)

jobs_location <- df %>%
     rowwise() %>% 
     mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[1]),
            longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[2])) %>%
     select(latitude, longitude)

In order to speed up the process, I explored the multidyplyr library without success, I am getting an error message saying that my dataset is not a data frame.

jobs_location <- jobs %>%
  partition() %>%
  rowwise() %>% 
  mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[1]),
         longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[2])) %>%
  collect()

Michael · Accepted Answer

All credit to @DavidArenburg

I approached the problem from a non-efficient angle. Using a vectorized approach clearly improve the performance.

For the sake of completeness, I run the code on a random tiny subset of the whole dataset to evaluate the performance and clearly vectorization is the way to go for my problem.

Finally it is important to mention that a pre-cleaning task is needed to ensure the resulting transformation is numeric (refer to David's comment for more details)

library(dplyr)
library(data.table)
library(microbenchmark)
library(ggplot2)

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                   c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]
)

autoplot(mbm)

A picture is worth a thousand words

Another suggestion from David is to convert the data to numeric after tstrsplit. I added two functions one that is doing the conversion on the overall columns and one that apply the type conversion after the split.

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)],

  c = sapply(setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2)], as.numeric),

  d = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                        c("latitude", "longitude") := lapply(tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2), as.numeric)]
)
autoplot(mbm)

The last variant (d) is clearly the winner.

How to speed up row level operation with dplyr

Answers (1)

Related Questions