Tavi
Tavi

Reputation: 2748

Get unique string from a vector of similar strings

I don't quite know how to phrase the question. I have just started to work on a bunch of tweets, I've done some basic cleaning and now some of the tweets look like:

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Basically I want to remove the repetitions by checking if the first parts of the strings match and return the longest of them. In this case my result should be:

[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"

because all the others are truncated repetitions of the above. I've tried using the unique() function but it doesn't return the results I want because it tries to match the whole length of the strings. Any pointers please?

I'm using R version 3.1.1 on Mac OSX 10.7...

Thanks!

Upvotes: 4

Views: 460

Answers (3)

bartektartanus
bartektartanus

Reputation: 16080

@tonytonov solution's is good, but i recommend to use stringi package :)

stringi <- function(x){
  x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}

stringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
       expr     min       lq   median       uq      max neval
 stringi(x)  52.482  58.1760  64.3275  71.9630  120.374   100
 stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601   100

Upvotes: 1

Matthew Plourde
Matthew Plourde

Reputation: 44614

This is another option. I've added one string to your sample data.

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Filter(function(y) {
    x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
    ! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)


# [1] "stackoverflow is a great site" "stackoverflow is an OK site"   "omg it is friday and so sunny" [4] "arggh how annoying"  

Upvotes: 2

tonytonov
tonytonov

Reputation: 25638

Here's my attempt:

library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying" 

Basically, I exclude those strings which are already included in any of the others. This is perhaps a bit different from what you describe, but does approximately the same and is quite simple.

Upvotes: 1

Related Questions