Reputation: 45
I have a data frame consisting of 500 written protocols of parliamentary debates, where every session represents a new row. My goal is to create a data frame in which each row is a unique utterance and where no parts of the strings are dropped.
The protocols have a standard format, in which each new utterance is introduced by the name and party/organisation of the speaker, followed by a colon. For example, these instances take the form of
"MP Peter Mueller (SPD):" or " "External Expert Petra Meier (German Trade Union):"
The protocols may look a little like this:
protocol <- "MP Peter Mueller (SPD): What do you think about the bill? External Expert Petra Meier (German Trade Union): I support the bill. MP Peter Mueller (SPD): Thank you for your expertise."
I am familiar with all the regular expressions I need to match and locate these instances of a new utterance. For this example, I would use.
utterances <- c(grep("MP \\w+ \\w+ \\(\\w+\\):", protocol),
grep("External Expert \\w+ \\w+ \\(\\w+ \\w+ \\w+\\):", protocol))
I am now struggling to extract every new utterance as a substring and write it into a new row in my data frame. My expected output is:
dataframe
[1] "MP Peter Mueller (SPD): What do you think about the bill?"
[2] "External Expert Petra Meier (German Trade Union): I support the bill."
[3] "MP Peter Mueller (SPD): Thank you for your expertise."
Thanks a lot for your help!
Upvotes: 0
Views: 37
Reputation: 23574
I leave what I said in my comment here. Using the provided example (i.e., protocol), one suggestion is the following. You want to split strings with a space which follows either ?
or .
. Here I used stri_split_regex()
in the stringi package. But you can use any other similar functions.
trimws(unlist(stri_split_regex(str = protocol, pattern = "(?<=[.|?])", omit_empty = T)))
#[1] "MP Peter Mueller (SPD): What do you think about the bill?"
#[2] "External Expert Petra Meier (German Trade Union): I support the bill."
#[3] "MP Peter Mueller (SPD): Thank you for your expertise."
Upvotes: 1