Building corpus in Quanteda while keeping track of the ID

Question

I have a dataset in which I have multiple texts per user. I want to build a corpus of all those documents with Quanteda but without losing the ability to link back the different texts to the corresponding user.

I will give you a sample code to help you understand a little bit more where I am failing.

df <- data.frame('ID'=c(1,1,2), 'Text'=c('I ate apple', "I don't like fruits", "I swim in the dark"), stringsAsFactors = FALSE)
df_corpus <- corpus(df$Text, docnames =df$ID)
corpus_DFM <- dfm(df_corpus, tolower = TRUE, stem = FALSE)
print(corpus_DFM)

This results in

Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
     features
docs  i ate apple don't like fruits swim in the dark
  1   1   1     1     0    0      0    0  0   0    0
  1.1 1   0     0     1    1      1    0  0   0    0
  2   1   0     0     0    0      0    1  1   1    1
>

But I would like to obtain in dataframe that looks like this in my Document-feature matrix


Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
       features
docs    id  i ate apple don't like fruits swim in the dark
  text1 1   1   1     1     0    0      0    0  0   0    0
  text2 1   1   0     0     1    1      1    0  0   0    0
  text3 2   1   0     0     0    0      0    1  1   1    1
>

Is there a way to automatize this process using Quanteda. I would like to modify the the docs column of the dfm object but I do not know how to have access to it.

Any help would be welcome!

Thank you.

Building corpus in Quanteda while keeping track of the ID

Answers (1)

Related Questions