Chris
Chris

Reputation: 1475

Process large DataTable using data from another in R

I need to process a large data.table, setting a value in one column where there is a match between two other columns and another separate (small) Search Terms data.table. The value of the column is taken from the Search Terms where a match is found.

A match is found specifically where:

I have simplified my use case in the code below which uses the default storms data set (10k records). The main DT has 6.5 million records and the search term DT has less than 200.

My two questions are:

  1. is there a way to make this more efficient (and reduce the estimated 90 minutes elapsed it will take to process 6.5 million record in the large data table?

  2. how I can obtain a data.table result instead of the matrix which has been transposed (with columns as rows and rows as columns)?

Many thanks.

library(tidyverse)
library(lubridate)
library(data.table)

data("storms")
setDT(storms)
storms <- storms %>% 
  mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

# Function to determine if any search terms apply to the given row
# Typically only one is expected, although not guaranteed, so the first is taken
# A replacedStatus field is added containing either
# - the parameterised replacementStatus where a matching search term has been identified
# - the original status where no match has been identified 
recodeValues <- function(row) { 
  date = row["date"]
  status = row["status"]
  recodeMatch <- head(
    searchTerms[str_detect(status, searchTerms$searchStatus) &
                  (searchTerms$from == "" | date >= as.Date(searchTerms$from)) &
                  (searchTerms$to == ""   | date <= as.Date(searchTerms$to))
                  ,]
      ,1)
  # I would use mult = "first" in the selection above but it seem to have no impact, so use head() instead
  row["replacedStatus"]  <- if_else(nrow(recodeMatch) > 0, recodeMatch[1]$replacementStatus, status)
  return(row)
}

cat("Starting Recoding", "\n")
processorTime <- proc.time()
result <- apply(storms, 1, recodeValues)
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
View(result)

Upvotes: 0

Views: 69

Answers (1)

user12728748
user12728748

Reputation: 8506

If I understand correctly what you want it might make more sense to iterate over the small "searchTerms" data.table than over the large "storms" one.

You could then do something like this, which better leverages the power of data.table:

library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>% 
    mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

cat("Starting Recoding", "\n")
#> Starting Recoding
processorTime <- proc.time()
for(i in seq_len(dim(searchTerms)[1])){
    x <- as.list(searchTerms[i])
    if(x$from == "") {
        storms[grep(x$searchStatus, status),
               status:= x$replacementStatus]  
    } else {
        storms[grep(x$searchStatus, status) &
                   between(date, as.Date(x$from), as.Date(x$to)),
               status:= x$replacementStatus]
    }
}

cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
#> Recoding time (Elapsed): 0.034  seconds
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
#> Estimated Recoding time (Elapsed) for 6.5m records: 0.3787879  minutes
tail(storms[])
#>    name year month day hour  lat  long             status category wind
#> 1: Kate 2015    11  10   12 29.5 -75.4     tropical storm        0   60
#> 2: Kate 2015    11  10   18 31.2 -74.0     tropical storm        0   60
#> 3: Kate 2015    11  11    0 33.1 -71.3 hurricane any time        1   65
#> 4: Kate 2015    11  11    6 35.2 -67.6 hurricane any time        1   70
#> 5: Kate 2015    11  11   12 36.2 -62.5 hurricane any time        1   75
#> 6: Kate 2015    11  11   18 37.6 -58.2 hurricane any time        1   65
#>    pressure ts_diameter hu_diameter       date
#> 1:      998    103.5702      0.0000 2015-11-10
#> 2:      993    103.5702      0.0000 2015-11-10
#> 3:      990    161.1092     23.0156 2015-11-11
#> 4:      985    207.1404     23.0156 2015-11-11
#> 5:      980    345.2340     34.5234 2015-11-11
#> 6:      980    379.7574     46.0312 2015-11-11

Upvotes: 1

Related Questions