Regex in R: How to separate or omit groups of regex to parse a transcript? (Massive problem with unequal rows)

Question

I'm working on creating a custom transcript parser that only parses speakers from speeches. I've hit a roadblock in how to set this up in a for loop. The issue: I have over 50 rows with a column named "lines", this variable carries the data that I need to extract and append to the data already there. The data looks as follows:

        col_date        col_name_speech           col_lines
Row 1:  2001-06-09         test1          "SPEAKER A. W. Hello this is a speech that I am giving. Blah. 
                                            Blah. SPEAKER B. W. This 
                                            is ALSO a line of a speech that I am giving."

Row 2:    2002-09-10         test2         "This is procedural garbage. I hate garbage. Blah BLAH Blah. 
                                            header. SPEAKER. T. I. I have a speech now."

Row 3:    2006-09-19        test3          "procedural garbage. SPEAKER E. W. Wow, what a good speech. 
                                             SPEAKER. T. I. yes. SPEAKER E. W. indeed."

The data frame that I would like to produce should look like this in the end:

date           name        speaker          lines
2001-06-09     test1       SPEAKER A. W.    Hello this is a speech that I am giving. Blah. Blah.
2001-06-09     test1       SPEAKER B. W.    This is ALSO a line of a speech that I am giving
2002-09-10     test2       SPEAKER T. I.    I have a speech now.
2006-09-19     test3       SPEAKER E. W.    Wow, what a good speech. indeed.
2006-09-19     test3       SPEAKER. T. I.   yes.

You should notice that the lines variable drops procedural garbage from the header of the row and combines the speakers' speeches.

I have a decent regex pattern to detect speakers and it works when applied to single rows, but it does not work in a for loop since the ultimate data frame has different row numbers in terms of speakers and speeches (which I attribute to the header problem). I should note the difference in row numbers is almost the same amount as the number of rows. I will provide both sets of code.

Works to separate speakers one row at a time:

  person  = c(NA, unlist(stri_extract_all(data$lines[[1]], 
                                          regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)")))
  speech = unlist(stri_split(data$lines[[1]], 
                              regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)"))

To drop NA variable and combine like speakers (requires integrating into for loop):

df <-data.frame(person, speech)
df <- df[complete.cases(df), ]
df<- df %>%
  group_by(person) %>%
  summarise(speech= paste(speechsage, collapse = ","))

For loop that works for nothing:

for(i in 1:50){
  person  = c(NA, unlist(stri_extract_all(data$lines, 
                                          regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)")))
  message = unlist(stri_split(data$lines, 
                              regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)"))
}
df <- data.frame(person, speech)
error in data.frame(person, speech): arguments imply differing number of rows 927, 971

Moreover, I need to append the date and name variable to the final data frame which strikes me as easier to do inside of the for loop. ANY help on this is appreciated (especially if someone can show the complete steps needed to transform the data structure).

nj95 · Accepted Answer

This is one of those hard but ultimately easy questions in transcript parsing hell. I solved it with a few simple solutions in stringr, stringi, and tidyverse as Akash87 pointed out -- much of this answer took inspiration from his reply. I added another observation to the dataset for more diagnostics.

First, I create a dataframe out of the data:

df  <- data.frame(col_date = (c("2001-06-09", "2002-09-10", "2006-09-19")), 
       name = c("test1", "test2", "test3"), 
       col_lines = c("SPEAKER A. W. Hello this is a speech that I am giving. Blah. Blah. SPEAKER B. W. This is ALSO a line of a speech that I am giving.","This is procedural garbage. I hate garbage. Blah BLAH Blah. header. SPEAKER. T. I. I have a speech now.","procedural garbage. SPEAKER E. W. Wow, what a good speech. SPEAKER. T. I. yes. SPEAKER E. W. indeed. SPEAKER A. W. Hello this is a speech that I am giving. "))
df$speakers <- NA
df$speech <- NA

Then I save the regex pattern for easier access (don't mind the last pipe).

speakers_names  = "([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)|[[:upper:]]+([[:upper:]]\s[[:upper:]]{4}\.)"

I directly extract the speakers into the dataframe

df$speakers <- stri_extract_all(df$col_lines, regex= speakers_names)

Speeches are extracted by removing the names and excess data.

df$speech = str_remove(df$col_lines, ".*?(?=(([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\.+[ [:upper:].]+\.)|([[:upper:]]+[[:upper:]]{19,20}\.)|([[:upper:]]+([[:upper:]]\s[[:upper:]]{4}\.))))")
df$speech = stri_split(df$speech, regex= speakers_names)

This leaves us with a dataframe with nested lists. So we need to unnest them and shape up the data. Keep in mind that we need to get rid of the NA'd items to match our cases.

df <- data.frame(df %>% unnest(speakers), (df %>% unnest(speech) %>%
  na_if("") %>%
  na.omit()))
keeps <- c("col_date", "name", "speakers", "speech.1")
df <- df[ ,keeps, drop =FALSE]

Now all we have left is to concatenate speakers' speeches for a given date. I group by date, names, and speakers and summarise the strings.

df <- df %>% 
  group_by(col_date, name, speakers) %>%
  summarise(speech.1 = toString(speech.1))
View(df)

And we are done. What was a baffling Kafka-esque never-ending and grueling battle with wrangling this data into something workable, is solvable in 18 lines. If anyone has suggestions, let me know.

I hope that anyone who is parsing transcripts can use this question as a blueprint -- especially if they come from long and complicated JSONs. This works on most stenographic styles/languages as long as the regex is specific enough to capture the speakers' names (and this, of course, also applies to stage directions, etc). Cheers.

Regex in R: How to separate or omit groups of regex to parse a transcript? (Massive problem with unequal rows)

Answers (2)

Related Questions