Reputation: 69
I have a tokens
object in words, without punctuation:
doc | text |
---|---|
doc1 | 'Mohammed' 'Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing' |
doc2 | 'M' 'Fisher' 'likes' 'fishing' 'Fishing' 'yay' |
I want to use tokens_compound
on this to join certain multi-word expressions via underscore:
doc | text |
---|---|
doc1 | 'Mohammed_Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing' |
doc2 | 'M_Fisher' 'likes' 'fishing' 'Fishing' 'yay' |
Therefore, I defined a list of multi-word expressions I want to join and used tokens_compound
:
multiword <- c('Mohammed Fisher', 'M Fisher')
comp_toks <- tokens_compound(tokens, pattern = phrase(multiword))
This does not work, neither does
comp_toks <- tokens_compound(tokens, pattern = as.phrase(multiword))
nor
comp_toks <- tokens_compound(tokens, multiword)
What am I missing here?
Upvotes: 1
Views: 359
Reputation: 890
Use phrase()
instead of as.phrase()
.
> quanteda::phrase(c('Mohammed Fisher', 'M Fisher'))
[[1]]
[1] "Mohammed" "Fisher"
[[2]]
[1] "M" "Fisher"
Upvotes: 2
Reputation: 79194
I am not quite familiar with quanteda
.
Try this:
'
from your text
columntoks <- ...
tokens_compound
andkwic
https://quanteda.io/reference/kwic.htmllibrary(quanteda)
library(dplyr)
df1 <- df %>%
mutate(text = str_remove_all(text, "\\'"))
toks <- tokens(df1$text)
toks_comp <- tokens_compound(toks, pattern = phrase(c("Mohammed Fisher*", "M Fisher*")))
kw_com <- kwic(toks_comp, pattern= c("Mohammed_Fisher*", "M_Fisher*"))
kw_com
Keyword-in-context with 2 matches.
[text1, 1] | Mohammed_Fisher | is a great guy He
[text2, 1] | M_Fisher | likes fishing Fishing yay
Upvotes: 1