how to read text files in quanteda, storing each line as a document

Question

I have texts stored in several files.
Within the files each line is a document (text of a blog post, text of a tweet Etc.).
If I read using the readtext package in the default way shown in doc/examples the content of each file will be a single document instead of each line being a document.

My goal is to use a quanteda corpus, with each line stored as a document.
I am using readtext as it is a companion package to quanteda, but using readtext is not a strict requirement.

I would like to avoid manually splitting the originary files in smaller files each corresponding to a line.

Kohei Watanabe · Accepted Answer

Method 1: use readLines() in combination with list.files():

txt <- character()
for (f in list.files("your-folder")) {
   txt <- c(txt, readLines(f))
}
corp <- corpus(txt)

Method 2: you can split lines in a corpus using corpus_segment():

corp <- corpus(readtext("your-folder")) 
corp_line <- corpus_segment(corp, "
",  extract_pattern = FALSE, pattern_position = "after")

how to read text files in quanteda, storing each line as a document

Answers (1)

Related Questions