Identify truncated strings and expanding to longest string

Question

Working with dirty data where many strings are truncated. Would like to create a new variable with the longest un-truncated string.

Example input:

x <- c("stackoverflow is a great site",
       "stackoverflow is a great si",
       "stackoverflow is a great",
       "stackoverflow is an OK site",
       "omg it is friday and so",
       "omg it is friday and so sunny",
       "arggh how annoying")

Desired output:

y <- c("stackoverflow is a great site",
       "stackoverflow is a great site",
       "stackoverflow is a great site",
       "stackoverflow is an OK site",
       "omg it is friday and so sunny",
       "omg it is friday and so sunny",
       "arggh how annoying")

After searching, the nearest I can find is this question\answer Get unique string from a vector of similar strings

The various answers in that tread can identify the truncated and not truncated strings. example function:

mystringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

d.b · Accepted Answer

Check for presence of each x in the remaining x and get the longest one.

sapply(x, function(s){
    temp = x[grepl(s, x)]
    temp[which.max(nchar(temp))]
},
USE.NAMES = FALSE)
#[1] "stackoverflow is a great site" "stackoverflow is a great site"
#[3] "stackoverflow is a great site" "stackoverflow is an OK site"  
#[5] "omg it is friday and so sunny" "omg it is friday and so sunny"
#[7] "arggh how annoying"

Identify truncated strings and expanding to longest string

Answers (2)

Related Questions