Ian Murray
Ian Murray

Reputation: 87

Create a Document Frequency Matrix in R

I am attempting to create a document frequency matrix in R.

I currently have a dataframe (df_2), which is made up of 2 columns:

  1. doc_num: which details which document each term is coming from

  2. text_token: which contains each tokenized word relating to each document.

Current Dataframe

The df's dimensions are 79,447 * 2.

However, there are only 400 actual documents in the 79,447 rows.

I have been trying to create this dfm using the tm package.

I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using the appropriately named "dfm" command.

However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects." I understand my data isn't currently in the correct format for the dfm command to work. My issue is that I don't know how to get from my current point to a matrix as appears below.

Example of what I would like the matrix to look like when complete:

Example Matrix

Where 2 is the number of times cat appears in doc_2.

Any help on this would be greatly appreciated.

Is mise le meas.

Upvotes: 0

Views: 390

Answers (1)

aiatay7n
aiatay7n

Reputation: 182

It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm(). If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for. Here is a simulation:

library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)

doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"     
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility" 
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten" 
doc4 <- "Corona Virus can be defeated through maximization of social distance"  

write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt

corp <- corpus(txt)
x <- dfm(corp)
View(x)

If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.

Upvotes: 0

Related Questions