Li4991
Li4991

Reputation: 81

Upload text document in R

I am trying to upload several text document into a data frame in R. My desired output is a matrix with two colums:

DOCUMENT CONTENT
Document A This is the content.
: ---- : -------:
Document B This is the content.
: ---- : -------:
Document C This is the content.

Within the column "CONTENT", all the text information from the text document (10-K report) shall be shown.

> setwd("C:/Users/folder")
> folder <- getwd()
> corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))

This will create a corpus and I can tokenize it. But I don't achieve to convert to a data frame nor my desiret output.

Can somebody help me?

Upvotes: 0

Views: 77

Answers (1)

jrcalabrese
jrcalabrese

Reputation: 2321

If you're only working with .txt files and your endgoal is a dataframe, then I think you can skip the corpus step and simply read in all your files as a list. The hard part is to get the names of the .txt files into a column called DOCUMENT, but this can be done in base R.

# make a reproducible example
a <- "this is a test"
b <- "this is a second test"
c <- "this is a third test"
write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")

# get working dir
folder <- getwd()

# get names/locations of all files
filelist <- list.files(path = folder, pattern =" *.txt", full.names = FALSE)

# read in the files and put them in a list
lst <- lapply(filelist, readLines)

# extract the names of the files without the `.txt` stuff
names(lst) <- filelist
namelist <- fs::path_file(filelist)
namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""), 
                   use.names = FALSE)

# give every matrix in the list its own name, which was its original file name
lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)

# combine into a dataframe
x <- do.call(rbind.data.frame, lst) 

# a small amount of clean-up
rownames(x) <- NULL
names(x)[names(x) == "V1"] <- "CONTENT"
x <- x[,c(2,1)]
x
#>   DOCUMENT               CONTENT
#> 1        a        this is a test
#> 2        b this is a second test
#> 3        c  this is a third test

Upvotes: 1

Related Questions