Reputation: 115
I am using tokens_lookup
to see whether some texts contain the words in my dictionary discarding matches included in some pattern of words with nested_scope = "dictionary"
, as described in this answer. The idea is to discard longer dictionary matches which contain a nested target word (e.g. include Ireland but not Northern Ireland).
Now I'd like to:
(1) create a dummy variable indicating whether the text contains the words in the dictionary. I managed to do it with the code below but I don't understand why I have to write IE as lowercase in as.logical
.
df <- structure(list(num = c(2345, 3564, 3636), text = c("Ireland lorem ipsum", "Lorem ipsum Northern
Ireland", "Ireland lorem ipsum Northern Ireland")), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame"))
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"),
tolower = F)
corpus <- corpus(df, text_field = "text")
toks <- tokens(corpus)
dfm <- tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary", case_insensitive = F) %>%
tokens_remove("Northern Ireland") %>%
dfm()
df$contains <- as.logical(dfm[, "ie"], case_insensitive = FALSE)
(2) Store the matches in another column by using kwic
. Is there a way to exclude a dictionary key in kwic (Northern Ireland in the example)? In my attempt I get a keyword column that contains both Ireland and Norther Irland matches. (I don't know if it makes any difference, but in my full dataset I have multiple matches per row). Thank you.
words <- kwic(toks, pattern = dict, case_insensitive = FALSE)
df$docname = dfm@Dimnames[["docs"]]
df_keywords <- merge(df, words[ , c("keyword")], by = 'docname', all.x = T)
df_keywords <- df_keywords %>% group_by(docname, num) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = c(docname, num, text, contains),
values_from = keyword, names_from = n, names_prefix = 'keyword')
Upvotes: 2
Views: 49
Reputation: 14902
You could do it this way:
df <- structure(list(
num = c(2345, 3564, 3636),
text = c("Ireland lorem ipsum", "Lorem ipsum Northern
Ireland", "Ireland lorem ipsum Northern Ireland")
),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")
)
library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"),
tolower = FALSE
)
corpus <- corpus(df, text_field = "text", docid_field = "num")
toks <- tokens(corpus)
Here you need to flip the tolower = FALSE
in the dfm()
call, or it will lowercase the keys from the tokens_lookup()
.
dfmat <- tokens_lookup(toks, dict, nested_scope = "dictionary", case_insensitive = FALSE) %>%
dfm(tolower = FALSE)
dfmat
## Document-feature matrix of: 3 documents, 2 features (33.33% sparse) and 0 docvars.
## features
## docs IE Northern Ireland
## 2345 1 0
## 3564 0 1
## 3636 1 1
df$contains_Ireland <- as.logical(dfmat[, "IE"])
df
## # A tibble: 3 × 3
## num text contains_Ireland
## <dbl> <chr> <lgl>
## 1 2345 "Ireland lorem ipsum" TRUE
## 2 3564 "Lorem ipsum Northern\nIreland" FALSE
## 3 3636 "Ireland lorem ipsum Northern Ireland" TRUE
For part 2, we don't have the match nesting implemented for kwic()
. But you can search for "Ireland" and then exclude the matches where "Northern" came before?
words <- kwic(toks, pattern = "Ireland", case_insensitive = FALSE, window = 2) %>%
as.data.frame() %>%
# removes the matches on IE value "Ireland" nested withing "Northern Ireland"
dplyr::filter(!stringr::str_detect(pre, "Northern$")) %>%
dplyr::mutate(num = as.numeric(docname))
words
## docname from to pre keyword post pattern num
## 1 2345 1 1 Ireland lorem ipsum Ireland 2345
## 2 3636 1 1 Ireland lorem ipsum Ireland 3636
dplyr::full_join(df, words, by = "num")
## # A tibble: 3 × 10
## num text contains_Ireland docname from to pre keyword post pattern
## <dbl> <chr> <lgl> <chr> <int> <int> <chr> <chr> <chr> <fct>
## 1 2345 "Irela… TRUE 2345 1 1 "" Ireland lore… Ireland
## 2 3564 "Lorem… FALSE <NA> NA NA <NA> <NA> <NA> <NA>
## 3 3636 "Irela… TRUE 3636 1 1 "" Ireland lore… Ireland
Upvotes: 1