Reputation: 2748
I don't quite know how to phrase the question. I have just started to work on a bunch of tweets, I've done some basic cleaning and now some of the tweets look like:
x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")
Basically I want to remove the repetitions by checking if the first parts of the strings match and return the longest of them. In this case my result should be:
[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"
because all the others are truncated repetitions of the above. I've tried using the
unique()
function but it doesn't return the results I want because it tries to match the whole length of the strings. Any pointers please?
I'm using R version 3.1.1 on Mac OSX 10.7...
Thanks!
Upvotes: 4
Views: 460
Reputation: 16080
@tonytonov solution's is good, but i recommend to use stringi
package :)
stringi <- function(x){
x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}
stringr <- function(x){
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}
require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
expr min lq median uq max neval
stringi(x) 52.482 58.1760 64.3275 71.9630 120.374 100
stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601 100
Upvotes: 1
Reputation: 44614
This is another option. I've added one string to your sample data.
x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")
Filter(function(y) {
x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)
# [1] "stackoverflow is a great site" "stackoverflow is an OK site" "omg it is friday and so sunny" [4] "arggh how annoying"
Upvotes: 2
Reputation: 25638
Here's my attempt:
library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying"
Basically, I exclude those strings which are already included in any of the others. This is perhaps a bit different from what you describe, but does approximately the same and is quite simple.
Upvotes: 1