ab0rt
ab0rt

Reputation: 69

Using quantedas tokens_compound to join multi-word expressions via underscore in a tokens object

I have a tokens object in words, without punctuation:

doc text
doc1 'Mohammed' 'Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2 'M' 'Fisher' 'likes' 'fishing' 'Fishing' 'yay'

I want to use tokens_compound on this to join certain multi-word expressions via underscore:

doc text
doc1 'Mohammed_Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2 'M_Fisher' 'likes' 'fishing' 'Fishing' 'yay'

Therefore, I defined a list of multi-word expressions I want to join and used tokens_compound:

multiword <- c('Mohammed Fisher', 'M Fisher')
comp_toks <- tokens_compound(tokens, pattern = phrase(multiword))

This does not work, neither does

comp_toks <- tokens_compound(tokens, pattern = as.phrase(multiword))

nor

comp_toks <- tokens_compound(tokens, multiword)

What am I missing here?

Upvotes: 1

Views: 359

Answers (2)

Kohei Watanabe
Kohei Watanabe

Reputation: 890

Use phrase() instead of as.phrase().

> quanteda::phrase(c('Mohammed Fisher', 'M Fisher'))
[[1]]
[1] "Mohammed" "Fisher"  

[[2]]
[1] "M"      "Fisher"

Upvotes: 2

TarJae
TarJae

Reputation: 79194

I am not quite familiar with quanteda. Try this:

  1. remove ' from your text column
  2. define tokens as toks <- ...
  3. Use tokens_compound and
  4. apply kwic https://quanteda.io/reference/kwic.html
library(quanteda)
library(dplyr)

df1 <- df %>% 
    mutate(text = str_remove_all(text, "\\'"))


toks <- tokens(df1$text)

toks_comp <- tokens_compound(toks, pattern = phrase(c("Mohammed Fisher*", "M Fisher*")))
kw_com <- kwic(toks_comp, pattern= c("Mohammed_Fisher*", "M_Fisher*"))
kw_com

Keyword-in-context with 2 matches.                                                          
 [text1, 1]  | Mohammed_Fisher | is a great guy He        
 [text2, 1]  |    M_Fisher     | likes fishing Fishing yay

Upvotes: 1

Related Questions