Speed up complex loop and group by in R for large data set

Question

I have a below mentioned code:

    library(dplyr)

# Create sample data frame
df <- data.frame(
  ID = 1:6,
  SR1 = c(123,124,125,125,785,849),
  SR2 = c("as#12.c", "ae&14.v", "at$19.e", "at$19.d", "ab&22.n", "ab&22.n"),
  DRC1 = c("ABC-1", "ABC-1", "AXX-1", "AXX-1", "AWZ-2", "AWZ-5"),
  DX2 = c("SXI", "SXI", NA, "SCV", "DDF", "DDF"),
  stringsAsFactors = FALSE
)

# Create a function to give Status with each kind of DRC1 according to your rules
StatusJudge <- function(df_sub) {
  if (dim(df_sub)[1] == 1) {
    df_sub$Status <- FALSE
  }
  else {
    if (all(!is.na(df_sub$DX2))) {
      df_sub$Status <-
        ifelse(length(unique(df_sub$DX2)) == 1, TRUE, FALSE)
    }
    else {
      df_sub$Status <-
        ifelse(length(unique(df_sub$SR1)) == 1 | length(unique(df_sub$SR2)) == 1, TRUE, FALSE)
      if (any(!is.na(df_sub$DX2))) {
        df_sub$IDfound[is.na(df_sub$DX2)] <-
          df_sub$ID[!is.na(df_sub$DX2)][1]
      }
    }
  }
  return(df_sub)
}

# Apply the StatusJudge to each element of df_list and then combine the results
df <- df %>%
  mutate(Status = NA, IDfound = NA) %>%
  group_by(DRC1) %>%
  do(StatusJudge(.)) %>%
  arrange(ID)

Which gives below mentioned output:

ID    SR1     SR2  DRC1   DX2  Status IDfound
           
  1    123   as#12.c ABC-1  SXI   TRUE      NA
  2    124   ae&14.v ABC-1  SXI   TRUE      NA
  3    125   at$19.e AXX-1    TRUE       4
  4    125   at$19.d AXX-1  SCV   TRUE      NA
  5    785   ab&22.n AWZ-2  DDF  FALSE      NA
  6    849   ab&22.n AWZ-5  DDF  FALSE      NA

The problem here is i have a big dataset (~1 million Rows), where it's taking too much time even after waiting almost 4 hours i didn't get output. but the same code is working fine for small data set (~10K Rows etc.).

Please help to speedup this code.

Uwe · Accepted Answer

Please, test this data.table approach with your production data. I have tried to convert the nested if ... else and ifelse() statements into a boolean expression.

This seems to work as expected for the small sample dataset but needs thorough testing with more test cases.

library(data.table)

# use boolean expressions instead of if ... else clauses to create Status
setDT(df)[, Status := .N != 1L && 
            (all(!is.na(DX2)) && uniqueN(DX2 == 1L) ||
               any(is.na(DX2)) && (uniqueN(SR1) == 1L || uniqueN(SR2) == 1L)), by = DRC1][]

# append IDfound column
# create lookup table
mDT <- df[!is.na(DX2), .(DX2 = NA_character_, first(ID)), by = DRC1][]
# join with lookup table and update during join
df[mDT, on = .(DX2, DRC1), IDfound := V2][]

   ID SR1     SR2  DRC1 DX2 Status IDfound
1:  1 123 as#12.c ABC-1 SXI   TRUE      NA
2:  2 124 ae&14.v ABC-1 SXI   TRUE      NA
3:  3 125 at$19.e AXX-1  NA   TRUE       4
4:  4 125 at$19.d AXX-1 SCV   TRUE      NA
5:  5 785 ab&22.n AWZ-2 DDF  FALSE      NA
6:  6 849 ab&22.n AWZ-5 DDF  FALSE      NA

The lookup table mDT is used to find matches in the DX2 and DRC1 columns. V2 contains the ID of the first row in each DRC1 group where DX2 is not NA.

mDT

    DRC1 DX2 V2
1: ABC-1  NA  1
2: AXX-1  NA  4
3: AWZ-2  NA  5
4: AWZ-5  NA  6

Only those entries of df are updated where both DX2 and DRC1 match. By joining, it looks for rows where DRC1 matches and DX2 is NA. If one is found, the corresponding V2 values is copied to column IDfound.

Speed up complex loop and group by in R for large data set

Answers (1)

Related Questions