Tobias Remschel
Tobias Remschel

Reputation: 45

Splitting strings - isolating utterances in parliamentary debates

I have a data frame consisting of 500 written protocols of parliamentary debates, where every session represents a new row. My goal is to create a data frame in which each row is a unique utterance and where no parts of the strings are dropped.

The protocols have a standard format, in which each new utterance is introduced by the name and party/organisation of the speaker, followed by a colon. For example, these instances take the form of

"MP Peter Mueller (SPD):" or " "External Expert Petra Meier (German Trade Union):"

The protocols may look a little like this:

protocol <- "MP Peter Mueller (SPD): What do you think about the bill? External Expert Petra Meier (German Trade Union): I support the bill. MP Peter Mueller (SPD): Thank you for your expertise."

I am familiar with all the regular expressions I need to match and locate these instances of a new utterance. For this example, I would use.

utterances <- c(grep("MP \\w+ \\w+ \\(\\w+\\):", protocol),
                grep("External Expert \\w+ \\w+ \\(\\w+ \\w+ \\w+\\):", protocol))

I am now struggling to extract every new utterance as a substring and write it into a new row in my data frame. My expected output is:

dataframe
[1] "MP Peter Mueller (SPD): What do you think about the bill?" 
[2] "External Expert Petra Meier (German Trade Union): I support the bill." 
[3] "MP Peter Mueller (SPD): Thank you for your expertise."

Thanks a lot for your help!

Upvotes: 0

Views: 37

Answers (1)

jazzurro
jazzurro

Reputation: 23574

I leave what I said in my comment here. Using the provided example (i.e., protocol), one suggestion is the following. You want to split strings with a space which follows either ? or .. Here I used stri_split_regex() in the stringi package. But you can use any other similar functions.

trimws(unlist(stri_split_regex(str = protocol, pattern = "(?<=[.|?])", omit_empty = T)))

#[1] "MP Peter Mueller (SPD): What do you think about the bill?"            
#[2] "External Expert Petra Meier (German Trade Union): I support the bill."
#[3] "MP Peter Mueller (SPD): Thank you for your expertise." 

Upvotes: 1

Related Questions