Using quantedas tokens_compound to join multi-word expressions via underscore in a tokens object

Question

I have a tokens object in words, without punctuation:

doc	text
doc1	'Mohammed' 'Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2	'M' 'Fisher' 'likes' 'fishing' 'Fishing' 'yay'

I want to use tokens_compound on this to join certain multi-word expressions via underscore:

doc	text
doc1	'Mohammed_Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2	'M_Fisher' 'likes' 'fishing' 'Fishing' 'yay'

Therefore, I defined a list of multi-word expressions I want to join and used tokens_compound:

multiword <- c('Mohammed Fisher', 'M Fisher')
comp_toks <- tokens_compound(tokens, pattern = phrase(multiword))

This does not work, neither does

comp_toks <- tokens_compound(tokens, pattern = as.phrase(multiword))

nor

comp_toks <- tokens_compound(tokens, multiword)

What am I missing here?

Kohei Watanabe · Accepted Answer

Use phrase() instead of as.phrase().

> quanteda::phrase(c('Mohammed Fisher', 'M Fisher'))
[[1]]
[1] "Mohammed" "Fisher"  

[[2]]
[1] "M"      "Fisher"

Answers (2)