Reputation: 11
I'm a beginner using R and quanteda and I can't solve the following issue, even after having read similar threads.
I have a dataset imported from Stata where the column "text" contains tweets from different groups of people identified by the variable "group". I want to count occurences of words identified by my dictionary at group level in the following way:
Here is a reproducible example:
dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1",
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
First I create my dictionary:
climatechange_dict <- dictionary(list(
climate = c(
"environment*",
"climate change")))
Then I specify the corpus
climate_corpus <- corpus(tweets$tweet_message)
I create a dfm for each group:
group1_dfm <- dfm(corpus_subset(climate_corpus, tweets$group == "1"))
And then I try to calculate the frequence of the words in the dictionary for each group:
group1_climate <- dfm_lookup(group1_dfm, dictionary = climatechange_dict)
group1 <- subset(tweets, tweets$group == "1")
group1$climatescore <- as.numeric(group1_climate[,1])
group1$climate <- "normal"
group1$climate[group1$climatescore > 0] <- "climate"
table(group1$climate)
My problem is that in this way multiword dictionary entries such as "climate change" are not counted. I have read online I need to apply tokens_lookup() to the tokens and then construct the dfm, but I don't know how to do that in this case. I would be really grateful if you could help me on this. Many thanks!
Upvotes: 0
Views: 269
Reputation: 14902
It's hard to make sure that this will work since you don't supply a reproducible example, but try this:
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
This does the following:
creates a corpus from your tweets
data.frame and adds the other variables as docvars. (If you know which is a unique document identifier, you could specify that column too using docid_field = "<yourdocidentifier>"
.)
Does the dictionary "lookup" operation on the tokens, which means you will pick up the phrases like "climate change". This is not happening with dfm_lookup()
because dfm()
converts the tokens into "features" which have no record of order any more, and so cannot recover phrases.
Consolidates the documents into groups according to the group
column of tweets
. This obviates the need for any manual grouping using subsets. (I think this is what you wanted, right?)
The resulting dfm will be ngroups x 1, where 1 is the single key for your dictionary. You can easily coerce this to a data.frame or other format using convert()
.
Upvotes: 1