Reputation: 77
I have texts stored in several files.
Within the files each line is a document (text of a blog post, text of a tweet Etc.).
If I read using the readtext package in the default way shown in doc/examples the content of each file will be a single document instead of each line being a document.
My goal is to use a quanteda corpus, with each line stored as a document.
I am using readtext as it is a companion package to quanteda, but using readtext is not a strict requirement.
I would like to avoid manually splitting the originary files in smaller files each corresponding to a line.
Upvotes: 0
Views: 502
Reputation: 880
Method 1: use readLines()
in combination with list.files()
:
txt <- character()
for (f in list.files("your-folder")) {
txt <- c(txt, readLines(f))
}
corp <- corpus(txt)
Method 2: you can split lines in a corpus using corpus_segment()
:
corp <- corpus(readtext("your-folder"))
corp_line <- corpus_segment(corp, "\n", extract_pattern = FALSE, pattern_position = "after")
Upvotes: 1