M.Viking
M.Viking

Reputation: 5398

Identify truncated strings and expanding to longest string

Working with dirty data where many strings are truncated. Would like to create a new variable with the longest un-truncated string.

Example input:

x <- c("stackoverflow is a great site",
       "stackoverflow is a great si",
       "stackoverflow is a great",
       "stackoverflow is an OK site",
       "omg it is friday and so",
       "omg it is friday and so sunny",
       "arggh how annoying")

Desired output:

y <- c("stackoverflow is a great site",
       "stackoverflow is a great site",
       "stackoverflow is a great site",
       "stackoverflow is an OK site",
       "omg it is friday and so sunny",
       "omg it is friday and so sunny",
       "arggh how annoying")

After searching, the nearest I can find is this question\answer Get unique string from a vector of similar strings

The various answers in that tread can identify the truncated and not truncated strings. example function:

mystringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

Upvotes: 1

Views: 115

Answers (2)

mgiormenti
mgiormenti

Reputation: 833

Using your mystringr function:

library(data.table)

#Given a single non-truncated string, get the original values which where truncated versions of it:
get_complete_str <- function(complete_str) {
  data.table(x) %>% 
    .[str_detect(complete_str, x)] %>% 
    .[, y := complete_str]
}

# Apply that function to every possible non-truncated string, and bind the result together:
lapply(mystringr(x), FUN = get_complete_str) %>% 
  rbindlist()

Upvotes: 1

d.b
d.b

Reputation: 32548

Check for presence of each x in the remaining x and get the longest one.

sapply(x, function(s){
    temp = x[grepl(s, x)]
    temp[which.max(nchar(temp))]
},
USE.NAMES = FALSE)
#[1] "stackoverflow is a great site" "stackoverflow is a great site"
#[3] "stackoverflow is a great site" "stackoverflow is an OK site"  
#[5] "omg it is friday and so sunny" "omg it is friday and so sunny"
#[7] "arggh how annoying"  

Upvotes: 3

Related Questions