Reputation: 5398
Working with dirty data where many strings are truncated. Would like to create a new variable with the longest un-truncated string.
Example input:
x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so",
"omg it is friday and so sunny",
"arggh how annoying")
Desired output:
y <- c("stackoverflow is a great site",
"stackoverflow is a great site",
"stackoverflow is a great site",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so sunny",
"arggh how annoying")
After searching, the nearest I can find is this question\answer Get unique string from a vector of similar strings
The various answers in that tread can identify the truncated and not truncated strings. example function:
mystringr <- function(x){
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}
Upvotes: 1
Views: 115
Reputation: 833
Using your mystringr function:
library(data.table)
#Given a single non-truncated string, get the original values which where truncated versions of it:
get_complete_str <- function(complete_str) {
data.table(x) %>%
.[str_detect(complete_str, x)] %>%
.[, y := complete_str]
}
# Apply that function to every possible non-truncated string, and bind the result together:
lapply(mystringr(x), FUN = get_complete_str) %>%
rbindlist()
Upvotes: 1
Reputation: 32548
Check for presence of each x
in the remaining x
and get the longest one.
sapply(x, function(s){
temp = x[grepl(s, x)]
temp[which.max(nchar(temp))]
},
USE.NAMES = FALSE)
#[1] "stackoverflow is a great site" "stackoverflow is a great site"
#[3] "stackoverflow is a great site" "stackoverflow is an OK site"
#[5] "omg it is friday and so sunny" "omg it is friday and so sunny"
#[7] "arggh how annoying"
Upvotes: 3