Quanteda phrasetotoken does not work

Question

Situation 1

I get strange results when applying the phrasetotoken function in the Quanteda packages:

dict    <- dictionary(list(words = ......*lokale energie productie*......)) 
txt     <- c("I like lokale energie producties) 
phrasetotoken(txt, dict)

Problem: Sometimes I get lokale_energie_producties back, sometimes incorrectly the original lokale energie producties.

The problem seems connected to the dots in the dictionary. These dots are(?) needed to deal with starting and trailing characters (e.g., "1lokale energie productieniveau").

Situation 2

When loading in a txt file, the the prasetotoken function does not work at all.

txt <- paste(readLines("foo.txt", collapse=" ")
txt <- phrasetotoken(txt, dict)

NB. Using the function readtext instead of readLines throws the following error

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘phrasetotoken’ for signature ‘"readtext", "dictionary"’

Any help is appreciated.

Ken Benoit · Accepted Answer

Situation 1

We've replaced phrasetotoken() with a more powerful and flexible function tokens_compound(). It works like this (after some modifications of your code to make it syntactically correct):

txt <- c("I like lokale energie producties") 
toks <- tokens(txt)

tokens_compound(toks, list(words = c("*lokale", "energie",  "productie*")))
## tokens from 1 document.
## Component 1 :
## [1] "I"                         "like"                      "lokale_energie_producties"

Situtation 2

Try the following workflow instead:

require(magrittr)  # for the pipes
readtext("foo.txt") %>%
    corpus() %>%
    tokens() %>%
    tokens_compound(sequences = dict)

Quanteda phrasetotoken does not work

Answers (1)

Situation 1

Situtation 2

Related Questions