Reputation: 21432
I have transcriptions of speech with timestamps:
df
line speaker utterance timestamp
1 0001 ID16.1 ah-ha 00:00:07.060 - 00:00:07.660
3 0002 <NA> yes 00:00:07.964 - 00:00:08.610
5 0003 <NA> okay so where do we know each other from 00:00:16.350 - 00:00:22.170
7 0004 ID16.2 U uh Upper Rhine Cruises? maybe? 00:00:23.400 - 00:00:26.600
9 0005 ID16.3 yeah? ((pause)) well I do n't- 00:00:26.305 - 00:00:28.210
11 0006 ID16.1 (...) Meg? 00:00:27.385 - 00:00:29.305
13 0007 <NA> do you know Meg? 00:00:29.100 - 00:00:33.879
What I need to do are two things: if speaker
is NA
, (i) append the string in column utterance
to the utterance
in the prior row, and (ii) merge the two timestamps accordingly.
The desired outcome is this:
df
line speaker utterance timestamp
1 0001 ID16.1 ah-ha yes okay so where do we know each other from 00:00:07.060 - 00:00:22.170
3 0004 ID16.2 U uh Upper Rhine Cruises? maybe? 00:00:23.400 - 00:00:26.600
5 0005 ID16.3 yeah? ((pause)) well I do n't- 00:00:26.305 - 00:00:28.210
7 0006 ID16.1 (...) Meg? do you know Meg? 00:00:27.385 - 00:00:33.879
I've been trying to solve the problem using paste0
, dplyr::lag
, and dplyr:lead
but have not come far.
Reproducible data:
df <- structure(list(line = c("0001", "0002", "0003", "0004", "0005",
"0006", "0007"), speaker = c("ID16.1", NA, NA, "ID16.2",
"ID16.3", "ID16.1", NA), utterance = c("ah-ha", "yes",
"okay so where do we know each other from",
"U uh Upper Rhine Cruises? maybe? ", "yeah? ((pause)) well I do n't-",
"(...) Meg?", "do you know Meg?"
), timestamp = c("00:00:07.060 - 00:00:07.660", "00:00:07.964 - 00:00:08.610",
"00:00:16.350 - 00:00:22.170", "00:00:23.400 - 00:00:26.600",
"00:00:26.305 - 00:00:28.210", "00:00:27.385 - 00:00:29.305",
"00:00:29.100 - 00:00:33.879")), row.names = c(1L, 3L, 5L, 7L,
9L, 11L, 13L), class = "data.frame")
Upvotes: 0
Views: 25
Reputation: 160607
Try dplyr::group_by
. FYI, your displayed data is different from your df
, which changes the aggregation.
library(dplyr)
df %>%
group_by(notna = cumsum(!is.na(speaker))) %>%
summarize(
line = first(line),
speaker = first(speaker),
utterance = paste(utterance, collapse = " "),
timestamp = paste(unlist(strsplit(timestamp, "[- ]+"))[c(1, n()*2)], collapse = " - "),
.groups = "drop"
) %>%
select(-notna)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 4 x 4
# line speaker utterance timestamp
# <chr> <chr> <chr> <chr>
# 1 0001 ID16.1 "ah-ha yes okay so where do we know each other from" 00:00:07.060 - 00:00:22.170
# 2 0004 ID16.2 "U uh Upper Rhine Cruises? maybe? " 00:00:23.400 - 00:00:26.600
# 3 0005 ID16.3 "yeah? ((pause)) well I do n't-" 00:00:26.305 - 00:00:28.210
# 4 0006 ID16.1 "(...) Meg? do you know Meg?" 00:00:27.385 - 00:00:33.879
Upvotes: 1