Reputation: 113
I'm working on creating a custom transcript parser that only parses speakers from speeches. I've hit a roadblock in how to set this up in a for loop. The issue: I have over 50 rows with a column named "lines", this variable carries the data that I need to extract and append to the data already there. The data looks as follows:
col_date col_name_speech col_lines
Row 1: 2001-06-09 test1 "SPEAKER A. W. Hello this is a speech that I am giving. Blah.
Blah. SPEAKER B. W. This
is ALSO a line of a speech that I am giving."
Row 2: 2002-09-10 test2 "This is procedural garbage. I hate garbage. Blah BLAH Blah.
header. SPEAKER. T. I. I have a speech now."
Row 3: 2006-09-19 test3 "procedural garbage. SPEAKER E. W. Wow, what a good speech.
SPEAKER. T. I. yes. SPEAKER E. W. indeed."
The data frame that I would like to produce should look like this in the end:
date name speaker lines
2001-06-09 test1 SPEAKER A. W. Hello this is a speech that I am giving. Blah. Blah.
2001-06-09 test1 SPEAKER B. W. This is ALSO a line of a speech that I am giving
2002-09-10 test2 SPEAKER T. I. I have a speech now.
2006-09-19 test3 SPEAKER E. W. Wow, what a good speech. indeed.
2006-09-19 test3 SPEAKER. T. I. yes.
You should notice that the lines variable drops procedural garbage from the header of the row and combines the speakers' speeches.
I have a decent regex pattern to detect speakers and it works when applied to single rows, but it does not work in a for loop since the ultimate data frame has different row numbers in terms of speakers and speeches (which I attribute to the header problem). I should note the difference in row numbers is almost the same amount as the number of rows. I will provide both sets of code.
Works to separate speakers one row at a time:
person = c(NA, unlist(stri_extract_all(data$lines[[1]],
regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)")))
speech = unlist(stri_split(data$lines[[1]],
regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)"))
To drop NA variable and combine like speakers (requires integrating into for loop):
df <-data.frame(person, speech)
df <- df[complete.cases(df), ]
df<- df %>%
group_by(person) %>%
summarise(speech= paste(speechsage, collapse = ","))
For loop that works for nothing:
for(i in 1:50){
person = c(NA, unlist(stri_extract_all(data$lines,
regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)")))
message = unlist(stri_split(data$lines,
regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)"))
}
df <- data.frame(person, speech)
error in data.frame(person, speech): arguments imply differing number of rows 927, 971
Moreover, I need to append the date and name variable to the final data frame which strikes me as easier to do inside of the for loop. ANY help on this is appreciated (especially if someone can show the complete steps needed to transform the data structure).
Upvotes: 0
Views: 92
Reputation: 113
This is one of those hard but ultimately easy questions in transcript parsing hell. I solved it with a few simple solutions in stringr
, stringi
, and tidyverse
as Akash87 pointed out -- much of this answer took inspiration from his reply. I added another observation to the dataset for more diagnostics.
First, I create a dataframe out of the data:
df <- data.frame(col_date = (c("2001-06-09", "2002-09-10", "2006-09-19")),
name = c("test1", "test2", "test3"),
col_lines = c("SPEAKER A. W. Hello this is a speech that I am giving. Blah. Blah. SPEAKER B. W. This is ALSO a line of a speech that I am giving.","This is procedural garbage. I hate garbage. Blah BLAH Blah. header. SPEAKER. T. I. I have a speech now.","procedural garbage. SPEAKER E. W. Wow, what a good speech. SPEAKER. T. I. yes. SPEAKER E. W. indeed. SPEAKER A. W. Hello this is a speech that I am giving. "))
df$speakers <- NA
df$speech <- NA
Then I save the regex pattern for easier access (don't mind the last pipe).
speakers_names = "([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)|[[:upper:]]+([[:upper:]]\\s[[:upper:]]{4}\\.)"
I directly extract the speakers into the dataframe
df$speakers <- stri_extract_all(df$col_lines, regex= speakers_names)
Speeches are extracted by removing the names and excess data.
df$speech = str_remove(df$col_lines, ".*?(?=(([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)|([[:upper:]]+([[:upper:]]\\s[[:upper:]]{4}\\.))))")
df$speech = stri_split(df$speech, regex= speakers_names)
This leaves us with a dataframe with nested lists. So we need to unnest them and shape up the data. Keep in mind that we need to get rid of the NA'd items to match our cases.
df <- data.frame(df %>% unnest(speakers), (df %>% unnest(speech) %>%
na_if("") %>%
na.omit()))
keeps <- c("col_date", "name", "speakers", "speech.1")
df <- df[ ,keeps, drop =FALSE]
Now all we have left is to concatenate speakers' speeches for a given date. I group by date, names, and speakers and summarise the strings.
df <- df %>%
group_by(col_date, name, speakers) %>%
summarise(speech.1 = toString(speech.1))
View(df)
And we are done. What was a baffling Kafka-esque never-ending and grueling battle with wrangling this data into something workable, is solvable in 18 lines. If anyone has suggestions, let me know.
I hope that anyone who is parsing transcripts can use this question as a blueprint -- especially if they come from long and complicated JSONs. This works on most stenographic styles/languages as long as the regex is specific enough to capture the speakers' names (and this, of course, also applies to stage directions, etc). Cheers.
Upvotes: 1
Reputation: 3994
Wow this was a doozie! I used a few different libraries tidyverse
,stringr
, and stringi
:
First I create the data set in an R friendly format:
data <- data.frame(col_date = as.Date(c("2001-06-09", "2002-09-10", "2006-09-19")),
name = c("test1", "test2", "test3"),
col_lines = c("SPEAKER A. W. Hello this is a speech that I am giving. Blah. Blah. SPEAKER B. W. This is ALSO a line of a speech that I am giving.","This is procedural garbage. I hate garbage. Blah BLAH Blah. header. SPEAKER. T. I. I have a speech now.","procedural garbage. SPEAKER E. W. Wow, what a good speech. SPEAKER. T. I. yes. SPEAKER E. W. indeed."))
Then I identify the speakers for each col_lines
, and get a unique list of speakers for substringing later.
speakers <- sapply(lapply(data$col_lines,
function(x) stri_extract_all(x, regex="([[:upper:]]+[ [:upper:]]+[ [:upper:].]+\\.+[ [:upper:].]+\\.)|([[:upper:]]+[[:upper:]]{19,20}\\.)")), unlist)
speak_unique <- unique(unlist(speakers)) #GET UNIQUE SPEAKER LIST FOR
speak_mat <- plyr::ldply(speakers, rbind)
I bind the result of speak_mat
to my original data and make sure I name the columns for future use. The line used to generate speak_mat
is used to go from list to matrix, filling blanks with NA
in generation order.
data_new <- cbind(data, speak_mat)
colnames(data_new) <- c(names(data), paste("speaker_", names(speak_mat)))
Now I replace the speak_unique
in each col_line
with a single character representation
data_new_lines <- data_new %>%
mutate( col_lines_new = gsub(paste(speak_unique, collapse = "|"), ";", col_lines),
col_lines_new1 = gsub("[^;]*;(.*)", "\\1", col_lines_new))
I then get rid of trailing and starting whitespace for when a speak_unique
was the first or last character, and remove all text for each col_lines
that comes before the first speak_unique
. I also break apart each text at ;
and create a vector.
out <- unlist(lapply(strsplit(as.character(data_new_lines$col_lines_new1),";"), function(x) {x[!x == ""]}))
I take the out
and combine it with the long format of data_new
that does not have the original col_lines
.
data_long <- data_new %>%
select(-col_lines) %>%
gather(num, speaker, -col_date, -name, na.rm = T) %>%
arrange(col_date, num) %>%
select(-num) %>%
cbind(lines = out) %>%
group_by(col_date, name, speaker) %>%
summarise(lines_c = paste(trimws(lines), collapse = " "))
Hope this helps!
Upvotes: 0