Reputation: 1875
I am reading in the text from a number of PDFs in a directory.
Then, I split these texts into single words (tokens) using the tidytext::unnest_tokens()
-function.
Can someone please tell me, how I can add an additional column to the test
-tibble with the name of the file each word comes from?
library(pdftools)
library(tidyverse)
library(tidytext)
files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text)
list <- unlist(content, recursive = TRUE, use.names = TRUE)
df = data.frame(text = list)
test <- df %>% tidytext::unnest_tokens(word, text)
Upvotes: 1
Views: 456
Reputation: 4344
the plyr package as a nice function for binding to df and using list names as new columns:
library(pdftools)
library(plyr)
library(tidyverse)
library(tidytext)
files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text)
# set list name acording to files
names(content) <- files
list <- unlist(content, recursive = TRUE, use.names = TRUE)
# use the acorind function from plyr packages and check the result
plyr::ldply(list)
Upvotes: 1
Reputation: 30474
You can try the following. Instead of using unlist
with all the files, instead pass the entire list of files to map_df
from purrr
. Then, you can add a column with filename
along with the word
column.
library(pdftools)
library(tidyverse)
library(tidytext)
files <- list.files(pattern = "pdf$")
map_df(files, ~ data.frame(txt = pdf_text(.x)) %>%
mutate(filename = .x) %>%
unnest_tokens(word, txt))
Upvotes: 2
Reputation: 79228
You could do:
files <- list.files(pattern = "pdf$")
content <- stack(sapply(files, pdf_text, simplify = FALSE))
df %>%
tidytext::unnest_tokens(word, value)
Upvotes: 1