changing the output of text_tokens function in R

Question

I have a question redarding text mining with the corpus package and the function text_tokens(). I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens produces. So here is a basic example of my data and code:

library(tidyverse)
library(corpus)
library(stopwords)

text <- data.frame(comment_id = 1:2,
                   comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))


tmp <- text_tokens(text$comment_content, 
                   text_filter(stemmer = "de",drop = stopwords("german")))

My problem now is, that I want a data.frame as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:

df <- data.frame(comment_id = c(1,1,1,2,2,2),
                 comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))

I tried different do.calls (cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map() from the tidyverse?

Thank you in advance.

Cheers,

Aaron

Matt · Accepted Answer

Here's an option using imap_dfr from purrr:

library(corpus)
library(dplyr)
library(purrr)

text <- data.frame(comment_id = 1:2,
                   comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))


tmp <- text_tokens(text$comment_content, 
                   text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>% 
  purrr::imap_dfr(function(x, y) {
  tibble(
    comment_id = y,
    comment_tokens = x
  )
})

tmp
#> # A tibble: 6 × 2
#>   comment_id comment_tokens
#>                  
#> 1          1 hallo         
#> 2          1 nam           
#> 3          1 aaron         
#> 4          2 lieb          
#> 5          2 dank          
#> 6          2 video

Or if you prefer using an anonymous function:

tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>% 
  purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))

changing the output of text_tokens function in R

Answers (1)

Related Questions