James Martherus
James Martherus

Reputation: 1043

Parsing Interview Text

I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:

"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"

Would become:

   name          text
1   Bob Smith    Hi Steve. How are you doing?
2 Steve Brown    Hi Bob. I'm doing well!

Question: How do I split the statements from the names? I tried splitting on the colon:

data <- strsplit(data, split=":")

But then I get this:

"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"

When what I want is this:

"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"

Upvotes: 0

Views: 141

Answers (2)

Ian Campbell
Ian Campbell

Reputation: 24838

I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.

Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.

data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith"                    "Hi Steve. How are you doing?" "Steve Brown"                 
[4] "Hi Bob. I'm doing well!"  

Upvotes: 2

caldwellst
caldwellst

Reputation: 5956

We can extract these with regex using the stringr package. You then directly have the columns of speaker and quote you are looking for.

a <- "Bob: Hi Steve. Steve: Hi Bob."

library(stringr)

str_match_all(a, "([A-Za-z]*?): (.*?\\.)")
#> [[1]]
#>      [,1]             [,2]    [,3]       
#> [1,] "Bob: Hi Steve." "Bob"   "Hi Steve."
#> [2,] "Steve: Hi Bob." "Steve" "Hi Bob."

Upvotes: 0

Related Questions