Reputation: 13
I have a single text file that contains many speeches. The file contains two variables, one for speech_id
and the other for the text of the speech
and are separated by a pipe |
. I’m trying to use the corpus_segment
function in quanteda
to break the text into smaller documents.
The .txt
file looks like this:
Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this.
I’ve tried various iterations, but can’t seem to get it to work. I've also tried using the readtext function from the readtext package to read it in but no luck. Any help is greatly appreciated.
Upvotes: 1
Views: 311
Reputation: 14902
corpus_segment()
should work fine. (This is based on quanteda >= 1.0.0.) Here, I am assuming that all speech IDs are 10 digits followed by the |
character. Note that readtext would have worked to read this .txt file but that it would have been a single "document" of one row.
library("quanteda")
txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this."
corp <- corpus(txt)
corpseg <- corpus_segment(corp, pattern = "\\d{10}\\|", valuetype = "regex")
texts(corpseg)
## text1.1 text1.2
## "This is the first speech." "The second \nspeech starts here."
## text1.3 text1.4
## "This is the third speech." "The fourth \nspeaker says this."
That got it, but we can tidy it up a bit more by moving the pattern that was extracted to be a docname.
# move the tag to docname after removing "|"
docnames(corpseg) <-
stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL
summary(corpseg)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences
## 1140000001 6 6 1
## 1140000002 6 6 1
## 1140000003 6 6 1
## 1140000004 6 6 1
##
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\\d{10}\\|", valuetype = "regex")
Upvotes: 0