Pxu80
Pxu80

Reputation: 154

Quanteda phrasetotoken does not work

Situation 1

I get strange results when applying the phrasetotoken function in the Quanteda packages:

dict    <- dictionary(list(words = ......*lokale energie productie*......)) 
txt     <- c("I like lokale energie producties) 
phrasetotoken(txt, dict)

Problem: Sometimes I get lokale_energie_producties back, sometimes incorrectly the original lokale energie producties.

The problem seems connected to the dots in the dictionary. These dots are(?) needed to deal with starting and trailing characters (e.g., "1lokale energie productieniveau").

Situation 2

When loading in a txt file, the the prasetotoken function does not work at all.

txt <- paste(readLines("foo.txt", collapse=" ")
txt <- phrasetotoken(txt, dict)

NB. Using the function readtext instead of readLines throws the following error

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘phrasetotoken’ for signature ‘"readtext", "dictionary"’

Any help is appreciated.

Upvotes: 0

Views: 282

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

Situation 1

We've replaced phrasetotoken() with a more powerful and flexible function tokens_compound(). It works like this (after some modifications of your code to make it syntactically correct):

txt <- c("I like lokale energie producties") 
toks <- tokens(txt)

tokens_compound(toks, list(words = c("*lokale", "energie",  "productie*")))
## tokens from 1 document.
## Component 1 :
## [1] "I"                         "like"                      "lokale_energie_producties"

Situtation 2

Try the following workflow instead:

require(magrittr)  # for the pipes
readtext("foo.txt") %>%
    corpus() %>%
    tokens() %>%
    tokens_compound(sequences = dict) 

Upvotes: 0

Related Questions