Courtney Gerver
Courtney Gerver

Reputation: 99

How to combine rows, separated by returns, that start and end with specific characters?

I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:

data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've",
 "heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct",
 "about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

Intended output:

intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!

Upvotes: 4

Views: 145

Answers (6)

ThomasIsCoding
ThomasIsCoding

Reputation: 102469

> unlist(unname(by(s, cumsum(grepl("-->", s)), \(x) c(x[1], paste0(x[-1], collapse = " ")))))
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"

Upvotes: 3

Andre Wildberg
Andre Wildberg

Reputation: 19191

An approach using strsplit and paste. (Same idea as @Allan Cameron, but different execution).

tmp <- trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]])

ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", tmp), tmp, paste0("<v ", tmp, "</v>"))
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"

Without temporary variable

trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]]) |> 
  (\(x) ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", x), x, paste0("<v ", x, "</v>")))()

Upvotes: 5

Onyambu
Onyambu

Reputation: 79318

strsplit(paste(input, collapse = " "), ".(?=<v)|(?<=/v>).", perl=TRUE)[[1]]

[1] "00:00:03.990 --> 00:00:05.270"                                                
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"                                                
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"                 
[5] "00:00:10.810 --> 00:00:11.170"                                                
[6] "<v Bill>Awesome.</v>" 

Upvotes: 3

zx8754
zx8754

Reputation: 56219

The data looks like a broken version of SRT format. We can fix the format then use the dedicated package - srt.

# fix the format
ss <- split(transcript, cumsum(grepl("\\d+:\\d+:\\d+.\\d+", transcript)))
transcriptSRT <- unlist(lapply(seq_along(ss), \(i) c("", i, ss[[ i ]])))
write(transcriptSRT[ -1 ], "tmp.srt")

library(srt)
read_srt("tmp.srt", collapse = " ")

## A tibble: 3 × 4
#      n start   end subtitle                                                                     
#  <int> <dbl> <dbl> <chr>                                                                        
#1     1  3.99  5.27 <v Bill>I'm here to take some notes. I've heard this will be interesting.</v>
#2     2  5.77  7.37 <v Charlie>I believe you'll be correct about that, Bill.</v>                 
#3     3 10.8  11.2  <v Bill>Awesome.</v>

Upvotes: 4

Friede
Friede

Reputation: 7975

Some vectorised brute force using grepl and paste

brute_merge = \(char) {
  stopifnot(is.character(char))
  o = grepl("^<v", char); e = grepl("</v>", char)
  f = \(x, k = 1, fill = NA) c(x[-seq(k)], rep(fill, k))
  i = which(o & !e & f(e))
  char[i] = paste(char[i], char[i+1])
  char[-(i+1)]
}

giving

> brute_merge(input)
[1] "00:00:03.990 --> 00:00:05.270"                                                
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"                                                
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"                 
[5] "00:00:10.810 --> 00:00:11.170"                                                
[6] "<v Bill>Awesome.</v>"    

Data input

input = c(
  "00:00:03.990 --> 00:00:05.270",
  "<v Bill>I'm here to take some notes. I've",
  "heard this will be interesting.</v>",
  "00:00:05.770 --> 00:00:07.370",
  "<v Charlie>I believe you'll be correct",
  "about that, Bill.</v>",
  "00:00:10.810 --> 00:00:11.170",
  "<v Bill>Awesome.</v>"
)

Upvotes: 4

Allan Cameron
Allan Cameron

Reputation: 174378

You could paste the transcript together as a single long string, then use regular expressions to extract the timestamps and speech. Personally, I would want to keep these as distinct variables, but if you want you can interleave them together to give the desired output:

transcript <- c("00:00:03.990 --> 00:00:05.270",
                "<v Bill>I'm here to take some notes. I've",
                "heard this will be interesting.</v>",
                "00:00:05.770 --> 00:00:07.370",
                "<v Charlie>I believe you'll be correct",
                "about that, Bill.</v>",
                "00:00:10.810 --> 00:00:11.170",
                "<v Bill>Awesome.</v>")

transcript <- paste(transcript, collapse = " ")
timestamp_regex <- "\\d+:\\d+:\\d+.\\d+ --> \\d+:\\d+:\\d+.\\d+"
speech_regex <- "<v .*?</v>"

timestamps <- stringr::str_extract_all(transcript, timestamp_regex)[[1]]
speech <- stringr::str_extract_all(transcript, speech_regex)[[1]]

vctrs::vec_interleave(timestamps, speech)
#> [1] "00:00:03.990 --> 00:00:05.270"                                                
#> [2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
#> [3] "00:00:05.770 --> 00:00:07.370"                                                
#> [4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"                 
#> [5] "00:00:10.810 --> 00:00:11.170"                                                
#> [6] "<v Bill>Awesome.</v>" 

Upvotes: 5

Related Questions