Reputation: 315
I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run on each group of articles. My work steps are:
This seems to work okay. But I wondered if anybody more skilled in R might be able to check what I am doing is correct, or might suggest a more efficient way of doing it?
Thanks for any help!
#Function 1 to produce the FCM
get_fcm <- function(data) {
ch_stop <- stopwords("zh", source = "misc")
corp = corpus(data)
toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)
fcm = fcm(toks, context = "window", window = 1, tri = FALSE)
return(fcm)
}
>fcm_14q4 <- get_fcm(data_14q4)
>fcm_15q1 <- get_fcm(data_15q1)
#Function 2 to select the column for the 'term' of interest (such as China 中国) and make a data.frame
convert2df <- function(matrix, term){
mat_term = matrix[,term]
df = convert(mat_term, to = "data.frame")
colnames(df)[1] = "Term"
colnames(df)[2] = "Freq"
x = df[order(-df$Freq),]
return(x)
}
>CH14 <- convert2df(fcm_14q4, "中国")
>CH15 <- convert2df(fcm_15q1, "中国")
#Merging the data.frames
df <- merge(x=CH14q4, y=CH15q1, by="Term", all.x=TRUE, all.y=TRUE)
df <- merge(x=df, y=CH15q2, by="Term", all.x=TRUE, all.y=TRUE) #etc for all the dataframes...
UPDATE: Following Ken's advice in the comments below, I have tried doing it a different way, using the window function of tokens_select() and then a document feature matrix. After labelling the corpus documents according to their quarter, the following R function should take the tokenized corpus toks
and then produce a data.frame of the number of times words co-occur within a specified window
of a term
.
COOCdfm <- function(toks, term, window){
ch_stop = stopwords("zh", source = "misc")
cooc_toks = tokens_select(toks, term, window = window)
cooc_toks2 = tokens(cooc_toks, remove_punct = TRUE)
cooc_toks3 = tokens_remove(cooc_toks2, ch_stop)
dfmat = dfm(cooc_toks3)
dfmat_grouped = dfm_group(dfmat, groups = "quarter")
counts = convert(t(dfmat_grouped), to = "data.frame")
colnames(counts)[1] <- "Feature"
return(counts)
}
Upvotes: 2
Views: 142
Reputation: 14902
If you are interested in counting co-occurrences within a window for specific target terms, a better way is to use the window
argument of tokens_select()
, and then to count occurrences from a dfm on the window-selected tokens.
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural)
dfmat <- toks %>%
tokens_select("nuclear", window = 5) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
topfeatures(dfmat)[-1]
## weapons threat work earth elimination day
## 6 3 2 2 2 1
## one free world
## 1 1 1
Here I've first done a "conservative" tokenisation to keep everything, then performed the context selection. I then processed that further to remove punctuation and stopwords before tabulating the results in a dfm. This will be large and very sparse but you can summarise the top co-occuring words using topfeatures()
or quanteda.textstats::textstat_frequency().
Upvotes: 1