Reputation: 4180
Suppose I have a data frame like the one below:
data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33",
"22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb",
"33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data
# id
#1 11_22
#2 11_22_bb
#3 11_22_bb_33
#4 22_bb_33
#5 bb_33
#6 ab_cc_sd
#7 dd_e3_bb
#8 33_34_sd_22_32_87_cc
The strings in rows 1-2 & 4-5 are contained in the string in row 3. My goal is to remove rows 1-2 and rows 4-5 and retain only strings in rows that are not contained within other strings.
(1) The simplest solution is comparing each row against all other rows, and flagging it if it is found to be contained in another one(s). For example:
data$flag <- TRUE
for( i in 1:nrow( data ) ){
if( sum(grepl( data[i, 1], data[-i,1] )) > 0 )
data$flag[ i ] <- FALSE
}
data <- data[data$flag, ]
# id flag
#3 11_22_bb_33 TRUE
#6 ab_cc_sd TRUE
#7 dd_e3_bb TRUE
#8 33_34_sd_22_32_87_cc TRUE
But this is not efficient, especially when dealing with longer data frames.
(2) An improved solution seems to be to:
11_22_bb
is found to be contained within 11_22_bb_33
, 11_22_bb
is flagged, and when 11_22
is being compared, it will only be compared to 11_22_bb_33
and not 11_22_bb
.)As shown below:
data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33",
"22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb",
"33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data$len_char <- nchar(data$id)
data <- data[ order(data$len_char, decreasing=TRUE),]
data$flag <- TRUE
for( i in 2:nrow(data)){
if( sum(grepl( data[i, "id"],
data[ data$len_char > data$len_char[i] & data$flag, "id"])) > 0
)
data[i, "flag"] <- FALSE
}
data <- data[data$flag, ]
data
# id len_char flag
#8 33_34_sd_22_32_87_cc 20 TRUE
#3 11_22_bb_33 11 TRUE
#6 ab_cc_sd 8 TRUE
#7 dd_e3_bb 8 TRUE
My question: is there a way to make this even more efficient. At this point the 2nd method takes about 16 seconds to reduce a 6700 row data frame (the final, resulting data frame is 1400 rows). The first method takes about 50 seconds.
Upvotes: 0
Views: 53
Reputation: 17369
And vapply
wins the day.
data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33",
"22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb",
"33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data$len_char <- nchar(data$id)
data <- data[ order(data$len_char, decreasing=TRUE),]
data$flag <- TRUE
samp <- sample(1:nrow(data), 6700, replace = TRUE)
Strings <- Strings_orig <- data[samp, , drop=FALSE]
system.time({
for( i in 2:nrow(Strings)){
if( sum(grepl( Strings[i, "id"],
Strings[ Strings$len_char > Strings$len_char[i] & Strings$flag, "id"])) > 0
)
Strings[i, "flag"] <- FALSE
}
})
user system elapsed
3.81 0.00 3.81
Strings <- Strings_orig
system.time({
for (i in 2:nrow(Strings)){
Strings$flag[i] <- !any(grepl( Strings[i, "id"],
Strings[ Strings$len_char > Strings$len_char[i] & Strings$flag, "id"]))
}
})
user system elapsed
3.79 0.00 3.79
Strings <- Strings_orig
fn <- function(id, len_char){
any(grepl(id, Strings$id[Strings$len_char > len_char & Strings$flag]))
}
system.time({
vapply(Strings$flag, fn, TRUE, len_char = Strings$len_char)
})
user system elapsed
1.03 0.00 1.03
Upvotes: 1