maxheld
maxheld

Reputation: 4273

Shortest unique substrings of a vector

Say I have a character vector of some length,

vec <- c("man lives to work", "man works to live")

Starting from the beginning, I'd now like to find the shortest unique substrings (of complete words) in this vector. In other words, I'm not looking for the shortest substring overall, but I'd like crop the string after the word where it becomes unique, in this case after work and lives, respectively.

So the result should be, in this case:

[1] "man lives" "man works"

The strings should be cropped after lives/works because that's the earliest point at which they become unique (in this context). Including to would be redundant, because they are already unique. Including only man would not be enough, because c("man", "man") is not unique.

(I want to use this to automatically generate valid R names, and base::make.names() will do the rest).

How do I do this?

I figured, there has to be a package out there which already does this, but couldn't find it.

Upvotes: 1

Views: 433

Answers (2)

Ista
Ista

Reputation: 10437

As a general strategy I would a) check to see if the first word is unique, b) if not, check if the first two words are unique, c) continue until unique solutions have been found for each string.

You could implement this usit a while loop, or using recursion. Here is an example of the later (UPDATED to preserve order):

library(stringi) ## makes string processing easier

vec <- c("man lives to work", "man works to live")

(word.mat <- stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
##      [,1]  [,2]    [,3] [,4]  
## [1,] "man" "lives" "to" "work"
## [2,] "man" "works" "to" "live"

## function to extract unique words
unique_words <- function(x, # matrix of words
                         n = nrow(x), # number of original strings
                         nc=1 # number of columns (words) to use
                         ) {
    ## join the first nc words
    s <- stri_trim(apply(x[, 1:nc, drop = FALSE], 1, stri_join, collapse = " "))
    ## find non-duplicated word combinations, and store in column 1
    nodups <- !s %in% s[stri_duplicated(s)]
    x[nodups, 1] <- s[nodups]
    ## remove extra words from the matrix
    x[nodups, -1] <- ""
    ## if some strings are not unique, do it again, increasing nc by one
    if(any(x[, 2] != "")) {
        x <- unique_words(x = x, n = n, nc = nc + 1)
    ## otherwise, grab the unique sub-phrases from column 1    
    } else {
        x <- x[, 1]
    }
    ## return the result
    x
}    
## test it out
unique_words(word.mat)
## [1] "man lives" "man works"

## test it out with a more complicated example:
vec <- c("foo", "man lives to eat", "man eats to live",
         "woman lives to work", "woman works to live",
         "we like apples", "we like peaches",
         "they like plums", "they love peas", "bar")
unique_words(stri_split_boundaries(vec,
                                   type = "word",
                                   skip_word_none = TRUE,
                                   simplify = TRUE))
## [1] "foo"             "man lives"       "man eats"        "woman lives"    
## [5] "woman works"     "we like apples"  "we like peaches" "they like"      
## [9] "they love"       "bar"

Upvotes: 2

Mouad_Seridi
Mouad_Seridi

Reputation: 2716

df %>%  unnest_tokens(word ,words) %>%
  mutate(bigram = substr(word,1,2), 
         trigram = ifelse (nchar(word) >= 3,substr(word,1,3),NA) ,
         four_gram  = ifelse (nchar(word) >= 4, substr(word,1,4), NA), 
         five_gram  = ifelse (nchar(word) >= 5, substr(word,1,5), NA)) %>%
  group_by(bigram) %>%
  mutate(count_bigram = n()) %>%
  ungroup() %>%
  group_by(trigram) %>%
  mutate(count_trigram = n()) %>%
  ungroup() %>%
  group_by(four_gram) %>%
  mutate(count_four_gram = n()) %>%
  ungroup() %>%
  group_by(five_gram) %>%
  mutate(count_five_gram = n()) %>%
  ungroup()   %>% 
  summarise_each(funs(((function(x) {sum(x == 1)})(.))), 
                 count_bigram, count_trigram, 
                 count_four_gram, count_five_gram)



# # A tibble: 1 × 4
#    count_bigram count_trigram count_four_gram count_five_gram
#          <int>         <int>           <int>           <int>
#1            0             0               0               2

Upvotes: 1

Related Questions