Reputation: 1475
I need to process a large data.table, setting a value in one column where there is a match between two other columns and another separate (small) Search Terms data.table. The value of the column is taken from the Search Terms where a match is found.
A match is found specifically where:
I have simplified my use case in the code below which uses the default storms data set (10k records). The main DT has 6.5 million records and the search term DT has less than 200.
My two questions are:
is there a way to make this more efficient (and reduce the estimated 90 minutes elapsed it will take to process 6.5 million record in the large data table?
how I can obtain a data.table result instead of the matrix which has been transposed (with columns as rows and rows as columns)?
Many thanks.
library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>%
mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())
# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01")
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to)
setDT(searchTerms)
# Function to determine if any search terms apply to the given row
# Typically only one is expected, although not guaranteed, so the first is taken
# A replacedStatus field is added containing either
# - the parameterised replacementStatus where a matching search term has been identified
# - the original status where no match has been identified
recodeValues <- function(row) {
date = row["date"]
status = row["status"]
recodeMatch <- head(
searchTerms[str_detect(status, searchTerms$searchStatus) &
(searchTerms$from == "" | date >= as.Date(searchTerms$from)) &
(searchTerms$to == "" | date <= as.Date(searchTerms$to))
,]
,1)
# I would use mult = "first" in the selection above but it seem to have no impact, so use head() instead
row["replacedStatus"] <- if_else(nrow(recodeMatch) > 0, recodeMatch[1]$replacementStatus, status)
return(row)
}
cat("Starting Recoding", "\n")
processorTime <- proc.time()
result <- apply(storms, 1, recodeValues)
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
View(result)
Upvotes: 0
Views: 69
Reputation: 8506
If I understand correctly what you want it might make more sense to iterate over the small "searchTerms" data.table
than over the large "storms" one.
You could then do something like this, which better leverages the power of data.table
:
library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>%
mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())
# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01")
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to)
setDT(searchTerms)
cat("Starting Recoding", "\n")
#> Starting Recoding
processorTime <- proc.time()
for(i in seq_len(dim(searchTerms)[1])){
x <- as.list(searchTerms[i])
if(x$from == "") {
storms[grep(x$searchStatus, status),
status:= x$replacementStatus]
} else {
storms[grep(x$searchStatus, status) &
between(date, as.Date(x$from), as.Date(x$to)),
status:= x$replacementStatus]
}
}
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
#> Recoding time (Elapsed): 0.034 seconds
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
#> Estimated Recoding time (Elapsed) for 6.5m records: 0.3787879 minutes
tail(storms[])
#> name year month day hour lat long status category wind
#> 1: Kate 2015 11 10 12 29.5 -75.4 tropical storm 0 60
#> 2: Kate 2015 11 10 18 31.2 -74.0 tropical storm 0 60
#> 3: Kate 2015 11 11 0 33.1 -71.3 hurricane any time 1 65
#> 4: Kate 2015 11 11 6 35.2 -67.6 hurricane any time 1 70
#> 5: Kate 2015 11 11 12 36.2 -62.5 hurricane any time 1 75
#> 6: Kate 2015 11 11 18 37.6 -58.2 hurricane any time 1 65
#> pressure ts_diameter hu_diameter date
#> 1: 998 103.5702 0.0000 2015-11-10
#> 2: 993 103.5702 0.0000 2015-11-10
#> 3: 990 161.1092 23.0156 2015-11-11
#> 4: 985 207.1404 23.0156 2015-11-11
#> 5: 980 345.2340 34.5234 2015-11-11
#> 6: 980 379.7574 46.0312 2015-11-11
Upvotes: 1