nlplearner
nlplearner

Reputation: 115

How to create a document term incidence matrix from long format text data?

I've got data that look like this:

ID word
1 blue
1 red
1 green
1 yellow
2 blue
2 purple
2 orange
2 green

But I want to transform them into a binary incidence matrix denoting whether or not a word appears within a certain document ID. In other words, I'd like to create a matrix that looks like this:

ID blue red green yellow purple orange
1 1 1 1 1 0 0
2 1 0 1 0 1 1

Is there a way to do this with the tm package? I thought maybe using DocumentTermMatrix() would work since I don't think that any words in my corpus have multiple incidences within a single document, but everything I've tried has returned error messages about the incompatibility of the function with object class data.frame

Upvotes: 0

Views: 304

Answers (2)

JBGruber
JBGruber

Reputation: 12420

If you want to do this to run a supervised or unsupervised machine learning model, you should directly cast the tidy data frame into a document-feature-matrix (dfm). dfms are a class of sparse matrix that can be effectively used for these tasks. You can use cast_dfm from tidytext for this. But you have to count the occurrence of each word first.

library(tidyverse)
library(tidytext)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
  word = c("blue","red", "green","yellow","blue","purple","orange","green")
)

df %>% 
  count(ID, word) %>% 
  cast_dfm(ID, word, n)
#> Document-feature matrix of: 2 documents, 6 features (33.33% sparse) and 0 docvars.
#>     features
#> docs blue green red yellow orange purple
#>    1    1     1   1      1      0      0
#>    2    1     1   0      0      1      1

Created on 2022-02-12 by the reprex package (v2.0.1)

You can convert this object back into a data frame with quanteda::convert(x, to = "data.frame") but it would make more sense to use it directly if you run a classification task.

Upvotes: 2

PaulS
PaulS

Reputation: 25333

A possible solution, based on tidyr::pivot_wider:

library(tidyverse)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
  word = c("blue","red", "green","yellow","blue","purple","orange","green")
)

df %>% 
  pivot_wider(ID, names_from = word, values_from = word,
       values_fn = length, values_fill = 0)

#> # A tibble: 2 × 7
#>      ID  blue   red green yellow purple orange
#>   <int> <int> <int> <int>  <int>  <int>  <int>
#> 1     1     1     1     1      1      0      0
#> 2     2     1     0     1      0      1      1

Upvotes: 1

Related Questions