Alex
Alex

Reputation: 4180

Efficiently remove strings that are contained within other strings in a column in R

Suppose I have a data frame like the one below:

data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33", 
                      "22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb", 
                      "33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data

#                    id
#1                11_22
#2             11_22_bb
#3          11_22_bb_33
#4             22_bb_33
#5                bb_33
#6             ab_cc_sd
#7             dd_e3_bb
#8 33_34_sd_22_32_87_cc

The strings in rows 1-2 & 4-5 are contained in the string in row 3. My goal is to remove rows 1-2 and rows 4-5 and retain only strings in rows that are not contained within other strings.

(1) The simplest solution is comparing each row against all other rows, and flagging it if it is found to be contained in another one(s). For example:

data$flag <- TRUE
for( i in 1:nrow( data ) ){
  if( sum(grepl( data[i, 1], data[-i,1] )) > 0 )
      data$flag[ i ] <- FALSE
}
data <- data[data$flag, ]

#                    id flag
#3          11_22_bb_33 TRUE
#6             ab_cc_sd TRUE
#7             dd_e3_bb TRUE
#8 33_34_sd_22_32_87_cc TRUE

But this is not efficient, especially when dealing with longer data frames.

(2) An improved solution seems to be to:

  1. Sort the strings in descending length order;
  2. Compare each string (that is not the longest) to strings that are longer.
  3. If a string is found to be contained within a longer string, this shorter string is flagged and will not be compared against when an even shorter string is being compared. (E.g., when 11_22_bb is found to be contained within 11_22_bb_33, 11_22_bb is flagged, and when 11_22 is being compared, it will only be compared to 11_22_bb_33 and not 11_22_bb.)

As shown below:

data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33", 
                       "22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb", 
                       "33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data$len_char <- nchar(data$id)
data <- data[ order(data$len_char, decreasing=TRUE),]
data$flag <- TRUE

for( i in 2:nrow(data)){
    if( sum(grepl( data[i, "id"],
        data[ data$len_char > data$len_char[i] & data$flag, "id"])) > 0
       )
        data[i, "flag"] <- FALSE
}
data <- data[data$flag, ]
data

#                    id len_char flag
#8 33_34_sd_22_32_87_cc       20 TRUE
#3          11_22_bb_33       11 TRUE
#6             ab_cc_sd        8 TRUE
#7             dd_e3_bb        8 TRUE

My question: is there a way to make this even more efficient. At this point the 2nd method takes about 16 seconds to reduce a 6700 row data frame (the final, resulting data frame is 1400 rows). The first method takes about 50 seconds.

Upvotes: 0

Views: 53

Answers (1)

Benjamin
Benjamin

Reputation: 17369

And vapply wins the day.

data<-data.frame(id=c("11_22", "11_22_bb", "11_22_bb_33", 
                       "22_bb_33", "bb_33", "ab_cc_sd", "dd_e3_bb", 
                       "33_34_sd_22_32_87_cc"), stringsAsFactors=FALSE)
data$len_char <- nchar(data$id)
data <- data[ order(data$len_char, decreasing=TRUE),]
data$flag <- TRUE

samp <- sample(1:nrow(data), 6700, replace = TRUE)

Strings <- Strings_orig <- data[samp, , drop=FALSE]

system.time({
  for( i in 2:nrow(Strings)){
    if( sum(grepl( Strings[i, "id"],
        Strings[ Strings$len_char > Strings$len_char[i] & Strings$flag, "id"])) > 0
       )
        Strings[i, "flag"] <- FALSE
}
})

 user  system elapsed  
 3.81    0.00    3.81 


Strings <- Strings_orig

system.time({
for (i in 2:nrow(Strings)){
  Strings$flag[i] <- !any(grepl( Strings[i, "id"],
        Strings[ Strings$len_char > Strings$len_char[i] & Strings$flag, "id"]))
}
})

  user  system elapsed 
  3.79    0.00    3.79 



Strings <- Strings_orig
fn <- function(id, len_char){
  any(grepl(id, Strings$id[Strings$len_char > len_char & Strings$flag]))
}
system.time({
  vapply(Strings$flag, fn, TRUE, len_char = Strings$len_char)
})
  user  system elapsed 
  1.03    0.00    1.03 

Upvotes: 1

Related Questions