Reputation: 115
I've got data that look like this:
ID | word |
---|---|
1 | blue |
1 | red |
1 | green |
1 | yellow |
2 | blue |
2 | purple |
2 | orange |
2 | green |
But I want to transform them into a binary incidence matrix denoting whether or not a word appears within a certain document ID. In other words, I'd like to create a matrix that looks like this:
ID | blue | red | green | yellow | purple | orange |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 1 | 1 |
Is there a way to do this with the tm package? I thought maybe using DocumentTermMatrix() would work since I don't think that any words in my corpus have multiple incidences within a single document, but everything I've tried has returned error messages about the incompatibility of the function with object class data.frame
Upvotes: 0
Views: 304
Reputation: 12420
If you want to do this to run a supervised or unsupervised machine learning model, you should directly cast the tidy data frame into a document-feature-matrix (dfm). dfms are a class of sparse matrix that can be effectively used for these tasks. You can use cast_dfm
from tidytext
for this. But you have to count the occurrence of each word first.
library(tidyverse)
library(tidytext)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
word = c("blue","red", "green","yellow","blue","purple","orange","green")
)
df %>%
count(ID, word) %>%
cast_dfm(ID, word, n)
#> Document-feature matrix of: 2 documents, 6 features (33.33% sparse) and 0 docvars.
#> features
#> docs blue green red yellow orange purple
#> 1 1 1 1 1 0 0
#> 2 1 1 0 0 1 1
Created on 2022-02-12 by the reprex package (v2.0.1)
You can convert this object back into a data frame with quanteda::convert(x, to = "data.frame")
but it would make more sense to use it directly if you run a classification task.
Upvotes: 2
Reputation: 25333
A possible solution, based on tidyr::pivot_wider
:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
word = c("blue","red", "green","yellow","blue","purple","orange","green")
)
df %>%
pivot_wider(ID, names_from = word, values_from = word,
values_fn = length, values_fill = 0)
#> # A tibble: 2 × 7
#> ID blue red green yellow purple orange
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1 0 0
#> 2 2 1 0 1 0 1 1
Upvotes: 1