Include ID number in dfm() output

Question

I have a dataset with an ID number column and a text column, and I am running a LIWC analysis on the text data using the quanteda package. Here's an example of my data setup:

mydata<-data.frame(
  id=c(19,101,43,12),
  text=c("No wonder, then, that ever gathering volume from the mere transit ",
         "So that in many cases such a panic did he finally strike, that few ",
         "But there were still other and more vital practical influences at work",
         "Not even at the present day has the original prestige of the Sperm Whale"),
  stringsAsFactors=F
)

I have been able to conduct the LIWC analysis using scores <- dfm(as.character(mydata$text), dictionary = liwc)

However, when I view the results (View(scores)), I find that the function does not reference the original ID numbers (19, 101, 43, 12) in the final results. Instead, a row.names column is included but it contains non-descriptive identifiers (e.g., "text1", "text2"):

How can I get the dfm() function to include the ID numbers in its output? Thank you!

Ken Benoit · Accepted Answer

It sounds like you would like the row names of the dfm object to be the ID numbers from your mydata$id. This will happen automatically if you declare this ID to be the docnames for the texts. The easiest way to do this is to create a quanteda corpus object from your data.frame.

The corpus() call below assigns the docnames from your id variable. Note: The "Text" from the summary() call looks like a numeric value but it's actually the document name for the text.

require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
# Text Types Tokens Sentences
#   19    11     11         1
#  101    13     14         1
#   43    12     12         1
#   12    12     14         1
# 
# Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:

From there, the document name is automatically the row label in your dfm. (You can add the dictionary = argument for your LIWC application.)

myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
#      features
# docs  no wonder then that ever gathering
#   19   1      1    1    1    1         1
#   101  0      0    0    2    0         0
#   43   0      0    0    0    0         0
#   12   0      0    0    0    0         0

Include ID number in dfm() output

Answers (1)

Related Questions