Reputation: 99
I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:
data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
Intended output:
intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!
Upvotes: 4
Views: 145
Reputation: 102469
> unlist(unname(by(s, cumsum(grepl("-->", s)), \(x) c(x[1], paste0(x[-1], collapse = " ")))))
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"
Upvotes: 3
Reputation: 19191
An approach using strsplit
and paste
. (Same idea as @Allan Cameron, but different execution).
tmp <- trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]])
ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", tmp), tmp, paste0("<v ", tmp, "</v>"))
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"
Without temporary variable
trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]]) |>
(\(x) ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", x), x, paste0("<v ", x, "</v>")))()
Upvotes: 5
Reputation: 79318
strsplit(paste(input, collapse = " "), ".(?=<v)|(?<=/v>).", perl=TRUE)[[1]]
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"
Upvotes: 3
Reputation: 56219
The data looks like a broken version of SRT format. We can fix the format then use the dedicated package - srt.
# fix the format
ss <- split(transcript, cumsum(grepl("\\d+:\\d+:\\d+.\\d+", transcript)))
transcriptSRT <- unlist(lapply(seq_along(ss), \(i) c("", i, ss[[ i ]])))
write(transcriptSRT[ -1 ], "tmp.srt")
library(srt)
read_srt("tmp.srt", collapse = " ")
## A tibble: 3 × 4
# n start end subtitle
# <int> <dbl> <dbl> <chr>
#1 1 3.99 5.27 <v Bill>I'm here to take some notes. I've heard this will be interesting.</v>
#2 2 5.77 7.37 <v Charlie>I believe you'll be correct about that, Bill.</v>
#3 3 10.8 11.2 <v Bill>Awesome.</v>
Upvotes: 4
Reputation: 7975
Some vectorised brute force using grepl
and paste
brute_merge = \(char) {
stopifnot(is.character(char))
o = grepl("^<v", char); e = grepl("</v>", char)
f = \(x, k = 1, fill = NA) c(x[-seq(k)], rep(fill, k))
i = which(o & !e & f(e))
char[i] = paste(char[i], char[i+1])
char[-(i+1)]
}
giving
> brute_merge(input)
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"
Data input
input = c(
"00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"
)
Upvotes: 4
Reputation: 174378
You could paste
the transcript together as a single long string, then use regular expressions to extract the timestamps and speech. Personally, I would want to keep these as distinct variables, but if you want you can interleave them together to give the desired output:
transcript <- c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>")
transcript <- paste(transcript, collapse = " ")
timestamp_regex <- "\\d+:\\d+:\\d+.\\d+ --> \\d+:\\d+:\\d+.\\d+"
speech_regex <- "<v .*?</v>"
timestamps <- stringr::str_extract_all(transcript, timestamp_regex)[[1]]
speech <- stringr::str_extract_all(transcript, speech_regex)[[1]]
vctrs::vec_interleave(timestamps, speech)
#> [1] "00:00:03.990 --> 00:00:05.270"
#> [2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
#> [3] "00:00:05.770 --> 00:00:07.370"
#> [4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
#> [5] "00:00:10.810 --> 00:00:11.170"
#> [6] "<v Bill>Awesome.</v>"
Upvotes: 5