Reputation: 111
I have a dataset in which I have multiple texts per user. I want to build a corpus of all those documents with Quanteda but without losing the ability to link back the different texts to the corresponding user.
I will give you a sample code to help you understand a little bit more where I am failing.
df <- data.frame('ID'=c(1,1,2), 'Text'=c('I ate apple', "I don't like fruits", "I swim in the dark"), stringsAsFactors = FALSE)
df_corpus <- corpus(df$Text, docnames =df$ID)
corpus_DFM <- dfm(df_corpus, tolower = TRUE, stem = FALSE)
print(corpus_DFM)
This results in
Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
features
docs i ate apple don't like fruits swim in the dark
1 1 1 1 0 0 0 0 0 0 0
1.1 1 0 0 1 1 1 0 0 0 0
2 1 0 0 0 0 0 1 1 1 1
>
But I would like to obtain in dataframe that looks like this in my Document-feature matrix
Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
features
docs id i ate apple don't like fruits swim in the dark
text1 1 1 1 1 0 0 0 0 0 0 0
text2 1 1 0 0 1 1 1 0 0 0 0
text3 2 1 0 0 0 0 0 1 1 1 1
>
Is there a way to automatize this process using Quanteda. I would like to modify the the docs column of the dfm object but I do not know how to have access to it.
Any help would be welcome!
Thank you.
Upvotes: 1
Views: 347
Reputation: 14902
The issue here is that you are specifying the docnames as "ID", but document names have to be unique. This is why the corpus constructor function assigns 1, 1.1, 2 to your docnames based on the non-unique ID.
Solution? Let corpus()
assign the docnames, and keep ID
as a docvar (document variable). Easiest to do this by inputting the data.frame to corpus()
, which calls the data.frame method than the character method for corpus()
. (See ?corpus.)
Change your code to be:
> df_corpus <- corpus(df, text_field = "Text")
> corpus_DFM <- dfm(df_corpus, tolower = TRUE, stem = FALSE)
> print(corpus_DFM)
Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
features
docs i ate apple don't like fruits swim in the dark
text1 1 1 1 0 0 0 0 0 0 0
text2 1 0 0 1 1 1 0 0 0 0
text3 1 0 0 0 0 0 1 1 1 1
>
> docvars(corpus_DFM, "ID")
[1] 1 1 2
This enables you to easily recombine your dfm by user, if you want:
> dfm_group(corpus_DFM, groups = "ID")
Document-feature matrix of: 2 documents, 10 features (45.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs i ate apple don't like fruits swim in the dark
1 2 1 1 1 1 1 0 0 0 0
2 1 0 0 0 0 0 1 1 1 1
Upvotes: 2