Reputation: 154
Situation 1
I get strange results when applying the phrasetotoken function in the Quanteda packages:
dict <- dictionary(list(words = ......*lokale energie productie*......))
txt <- c("I like lokale energie producties)
phrasetotoken(txt, dict)
Problem: Sometimes I get lokale_energie_producties
back, sometimes incorrectly the original lokale energie producties
.
The problem seems connected to the dots in the dictionary. These dots are(?) needed to deal with starting and trailing characters (e.g., "1lokale energie productieniveau").
Situation 2
When loading in a txt file, the the prasetotoken function does not work at all.
txt <- paste(readLines("foo.txt", collapse=" ")
txt <- phrasetotoken(txt, dict)
NB. Using the function readtext
instead of readLines
throws the following error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘phrasetotoken’ for signature ‘"readtext", "dictionary"’
Any help is appreciated.
Upvotes: 0
Views: 282
Reputation: 14902
We've replaced phrasetotoken()
with a more powerful and flexible function tokens_compound()
. It works like this (after some modifications of your code to make it syntactically correct):
txt <- c("I like lokale energie producties")
toks <- tokens(txt)
tokens_compound(toks, list(words = c("*lokale", "energie", "productie*")))
## tokens from 1 document.
## Component 1 :
## [1] "I" "like" "lokale_energie_producties"
Try the following workflow instead:
require(magrittr) # for the pipes
readtext("foo.txt") %>%
corpus() %>%
tokens() %>%
tokens_compound(sequences = dict)
Upvotes: 0