Reputation: 67
I am using a dictionary to identify usage of a particular set of words in a corpus. I have included multi-word patterns in the dictionary, however, I don't think dfm_lookup (from the quanteda package) matches multi-word expressions. Does anyone know how to do the same thing as dfm_lookup with a dictionary containing multi-word expressions?
library(quanteda)
BritainEN <-
dictionary(list(identity=c("British", "Great Britain")))
British <- dfm_lookup(debate_dfm,
BritishEN,case_insensitive=T)
Upvotes: 2
Views: 889
Reputation: 14902
Yes - you need to use tokens_lookup()
on the tokens before you form the dfm. Once you have tokenized individual words, they no longer exist as the ordered sequence you need to match the multi-word values in your dictionary. So 1) form the tokens object, 2) use tokens_lookup()
to apply the dictionary to the tokens, and then 3) form the dfm.
library("quanteda")
#> Package version: 1.5.2
BritainEN <-
dictionary(list(identity = c("British", "Great Britain")))
txt <- c(doc1 = "Great Britain is a country.",
doc2 = "British citizens live in Great Britain.")
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is" "a" "country" "."
#>
#> doc2 :
#> [1] "IDENTITY" "citizens" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN) %>%
dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#> features
#> docs identity
#> doc1 1
#> doc2 2
Added
To answer the additional comment question and to extend on @phiver's very useful answer to this, there is also a nested_scope
argument designed for matches that might occur within another MWE dictionary key's value.
Example:
library("quanteda")
## Package version: 1.5.2
Ireland_nested <- dictionary(list(
ie_alone = "Ireland",
ie_nested = "Northern Ireland"
))
txt <- c(
doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland."
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "IE_ALONE" "."
tokens_lookup(toks,
dictionary = Ireland_nested, nested_scope = "dictionary",
exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "."
The first matches both keys, since the nesting level is just within key, but the nesting pattern occurs in two different keys. (In @phiver's the patterns were nested within key, in my example they are not.) When nested_scope = "dictionary"
, then it looks for nested pattern matches across the entire dictionary, not just within key, so it is not duplicated in my example.
Which you choose depends on your purpose. We designed quanteda to have the defaults that most users would want and expect, but added additional options like this for those with specific needs. (And usually those needs are first expressed by Kohei or me working on a specific use case of our own!)
Upvotes: 4
Reputation: 23598
To answer your question in the comment:
How does this work if the dictionary contains a word which then also appears in a multi-word expression in the dictionary
If the text contains "Northern Ireland" and the dictionary contains both "Northern Ireland" and "Ireland" it will only be counted once but ONLY IF both values are in the same dictionary grouping, like in the British example in Ken's answer.
See examples below for the differences.
Example combined dictionary:
library("quanteda")
Ireland_combined <-
dictionary(list(identity = c("Ireland", "Northern Ireland")))
txt <- c(doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland.")
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined ) %>%
dfm()
# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
# features
# docs identity
# doc1 1
# doc2 2
Example seperate dictionary entries:
Ireland_seperated <-
dictionary(list(identity1 = c("Ireland"),
identity2 = "Northern Ireland"))
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY1" "live" "in" "IDENTITY2" "IDENTITY1" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated ) %>%
dfm()
# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs identity1 identity2
# doc1 1 1
# doc2 2 1
Upvotes: 4