Reputation: 179
I have a below mentioned code:
library(dplyr)
# Create sample data frame
df <- data.frame(
ID = 1:6,
SR1 = c(123,124,125,125,785,849),
SR2 = c("as#12.c", "ae&14.v", "at$19.e", "at$19.d", "ab&22.n", "ab&22.n"),
DRC1 = c("ABC-1", "ABC-1", "AXX-1", "AXX-1", "AWZ-2", "AWZ-5"),
DX2 = c("SXI", "SXI", NA, "SCV", "DDF", "DDF"),
stringsAsFactors = FALSE
)
# Create a function to give Status with each kind of DRC1 according to your rules
StatusJudge <- function(df_sub) {
if (dim(df_sub)[1] == 1) {
df_sub$Status <- FALSE
}
else {
if (all(!is.na(df_sub$DX2))) {
df_sub$Status <-
ifelse(length(unique(df_sub$DX2)) == 1, TRUE, FALSE)
}
else {
df_sub$Status <-
ifelse(length(unique(df_sub$SR1)) == 1 | length(unique(df_sub$SR2)) == 1, TRUE, FALSE)
if (any(!is.na(df_sub$DX2))) {
df_sub$IDfound[is.na(df_sub$DX2)] <-
df_sub$ID[!is.na(df_sub$DX2)][1]
}
}
}
return(df_sub)
}
# Apply the StatusJudge to each element of df_list and then combine the results
df <- df %>%
mutate(Status = NA, IDfound = NA) %>%
group_by(DRC1) %>%
do(StatusJudge(.)) %>%
arrange(ID)
Which gives below mentioned output:
ID SR1 SR2 DRC1 DX2 Status IDfound
<int> <dbl> <chr> <chr> <chr> <lgl> <int>
1 123 as#12.c ABC-1 SXI TRUE NA
2 124 ae&14.v ABC-1 SXI TRUE NA
3 125 at$19.e AXX-1 <NA> TRUE 4
4 125 at$19.d AXX-1 SCV TRUE NA
5 785 ab&22.n AWZ-2 DDF FALSE NA
6 849 ab&22.n AWZ-5 DDF FALSE NA
The problem here is i have a big dataset (~1 million Rows), where it's taking too much time even after waiting almost 4 hours i didn't get output. but the same code is working fine for small data set (~10K Rows etc.).
Please help to speedup this code.
Upvotes: 0
Views: 113
Reputation: 42564
Please, test this data.table
approach with your production data. I have tried to convert the nested if ... else
and ifelse()
statements into a boolean expression.
This seems to work as expected for the small sample dataset but needs thorough testing with more test cases.
library(data.table)
# use boolean expressions instead of if ... else clauses to create Status
setDT(df)[, Status := .N != 1L &&
(all(!is.na(DX2)) && uniqueN(DX2 == 1L) ||
any(is.na(DX2)) && (uniqueN(SR1) == 1L || uniqueN(SR2) == 1L)), by = DRC1][]
# append IDfound column
# create lookup table
mDT <- df[!is.na(DX2), .(DX2 = NA_character_, first(ID)), by = DRC1][]
# join with lookup table and update during join
df[mDT, on = .(DX2, DRC1), IDfound := V2][]
ID SR1 SR2 DRC1 DX2 Status IDfound 1: 1 123 as#12.c ABC-1 SXI TRUE NA 2: 2 124 ae&14.v ABC-1 SXI TRUE NA 3: 3 125 at$19.e AXX-1 NA TRUE 4 4: 4 125 at$19.d AXX-1 SCV TRUE NA 5: 5 785 ab&22.n AWZ-2 DDF FALSE NA 6: 6 849 ab&22.n AWZ-5 DDF FALSE NA
The lookup table mDT
is used to find matches in the DX2
and DRC1
columns. V2
contains the ID
of the first row in each DRC1
group where DX2
is not NA
.
mDT
DRC1 DX2 V2 1: ABC-1 NA 1 2: AXX-1 NA 4 3: AWZ-2 NA 5 4: AWZ-5 NA 6
Only those entries of df
are updated where both DX2
and DRC1
match. By joining, it looks for rows where DRC1
matches and DX2
is NA
. If one is found, the corresponding V2
values is copied to column IDfound
.
Upvotes: 1