How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words)

Question

I am currently working on a text mining project and after running my ngrams model, I do realize I have sequences of repeated words. I would like to remove the repeated words while keeping their first occurrence. An illustration of what I intend to do is demonstrated with the code below. Thanks!


textfun <- "This this this  this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

textfun <- corpus(textfun)

textfuntoks <- tokens(textfun)

textfunRef <- tokens_replace(textfuntoks, pattern = **?**, replacement = **?**, valuetype ="regex")

The desired result is "This analysis should remove all of the duplicated or repeated words and return only their first occurrence". I am only interested in consecutive repetitions.

My main problem is in coming up with values for the "pattern" and the "replacement" arguments within the "tokens_replace" function. I have tried different patterns, some of which were adapted from sources on here but none seems to work. An image of the problem is included.[5grams frequency distribution showing instances such as for words like "swag", "pleas", "gas", "books", & "chicago", "happi"] 1

Ken Benoit · Accepted Answer

Interesting challenge. To do this within quanteda, you can create a dictionary mapping each repeat sequence into its single occurrence.

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus("This this this  this will analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence")
toks <- tokens(corp)

ngrams <- tokens_tolower(toks) %>%
  tokens_ngrams(n = 5:2, concatenator = " ") %>%
  as.character()
# choose only the ngrams that are all the same word
ngrams <- ngrams[lengths(sapply(strsplit(ngrams, split = " "), unique, simplify = TRUE)) == 1]
# remove duplicates
ngrams <- unique(ngrams)

head(ngrams, n = 3)
## [1] "all all all all all"                "return return return return return"
## [3] "this this this this"

So this provides a vector of all (lowercased) repeated values. (To avoid lowercasing, remove the tokens_tolower() line.)

Now we create a dictionary where each sequence is a "value", and each unique token is the "key". Multiple identical keys will exist in the list from which dict is built, but the dictionary() constructor automatically combines them. Once this is created, then the sequences can be converted to the single token using tokens_lookup().

dict <- dictionary(
  structure(
    # this causes each ngram to be treated as a single "value"
    as.list(ngrams),
    # each dictionary key will be the unique token
    names = sapply(ngrams, function(x) strsplit(x, split = " ")[[1]][1], simplify = TRUE, USE.NAMES = FALSE)
  )
)

# convert the sequence to their keys
toks2 <- tokens_lookup(toks, dict, exclusive = FALSE, nested_scope = "dictionary", capkeys = FALSE)

print(toks2, max_ntoken = -1)
## Tokens consisting of 1 document.
## text1 :
##  [1] "this"       "will"       "analysis"   "should"     "remove"    
##  [6] "all"        "of"         "the"        "duplicated" "or"        
## [11] "repeated"   "words"      "and"        "return"     "only"      
## [16] "their"      "first"      "occurrence"

^{Created on 2021-04-08 by the reprex package (v1.0.0)}

How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words)

Answers (2)

Related Questions