Chris Ruehlemann
Chris Ruehlemann

Reputation: 21410

How to add rows when time interval straddles minute mark

I have data with speech events timed from start to end in milliseconds:

df <- data.frame(
  speaker = c(NA, "A", NA, "B", NA, "C", NA, "C"),
  start = c(0,20000,35000,65000,80000,100000,110000,140000),
  end = c(20000,35000,65000,80000,100000,110000,140000,195000)
)

I need to add rows when start and end times of a row straddle a full-minute mark, while clipping the end time of the straddling row to the exact full-minute mark (60000, 120000etc.) and assigning the 'rest' to the end value in the added row. The desired output would be:

  speaker  start    end
1    <NA>      0  20000
2       A  20000  35000
3    <NA>  35000  60000   # clipped end time
4    <NA>  60000  65000   # added row
5       B  65000  80000
6    <NA>  80000 100000
7       C 100000 110000
8    <NA> 110000 120000   # clipped end time
9    <NA> 120000 140000   # added row
10      C 140000 120000   # clipped end time
11      C 120000 195000   # added row

EDIT:

Here's my (long-winded) solution:

Step 1: Create a dataframe in which the minute-straddling row is flagged:

df0 <- df %>%
  mutate(
    minute_start = as.integer(start/60000),
    minute_end = as.integer(end/60000),
    straddler = minute_end > minute_start)

Step 2: Create another dataframe subsetted on the flagged rows, where the end value is mutated to the exact minute mark:

df1 <- df0 %>%
  filter(straddler=="TRUE") %>%
  mutate(end = minute_end*60000) 

Step 3: Create another dataframe subsetted on the flagged rows, where the start value is mutated to the exact minute mark:

df2 <- df0 %>%
  filter(straddler=="TRUE") %>%
  mutate(start = minute_end*60000)

Step 4: Create another dataframe without the flagged rows:

df3 <- df0 %>%
  filter(!straddler == "TRUE")

Step 5: Finally, rbind the the dataframes, arrangeing them by start:

arrange(rbind(df1, df2, df3), start)
   speaker  start    end minute_start minute_end straddler
1     <NA>      0  20000            0          0     FALSE
2        A  20000  35000            0          0     FALSE
3     <NA>  35000  60000            0          1      TRUE
4     <NA>  60000  65000            0          1      TRUE
5        B  65000  80000            1          1     FALSE
6     <NA>  80000 100000            1          1     FALSE
7        C 100000 110000            1          1     FALSE
8     <NA> 110000 120000            1          2      TRUE
9     <NA> 120000 140000            1          2      TRUE
10       C 140000 180000            2          3      TRUE
11       C 180000 195000            2          3      TRUE

If anybody knows how to improve this, it'd be much appreciated!

Upvotes: 0

Views: 61

Answers (2)

ktiu
ktiu

Reputation: 2626

Interesting question! I came up with this:

library(dplyr)
library(purrr)
library(tibble)

mutate(df, across(-speaker, ~ floor(.x / 60000) + 1, .names = "mnt_{.col}")) %>%
  pmap(\(...) {
    with(list(...), if (mnt_start == mnt_end) tibble(speaker, start, end)
                    else tibble(speaker = rep(speaker, 2),
                                start = c(start, mnt_start * 60000),
                                end = c(mnt_start * 60000, end)))
  }) %>%
  bind_rows()

Returning:

# A tibble: 11 x 3
   speaker  start    end
   <chr>    <dbl>  <dbl>
 1 NA           0  20000
 2 A        20000  35000
 3 NA       35000  60000
 4 NA       60000  65000
 5 B        65000  80000
 6 NA       80000 100000
 7 C       100000 110000
 8 NA      110000 120000
 9 NA      120000 140000
10 C       140000 180000
11 C       180000 195000

Upvotes: 1

user16164298
user16164298

Reputation: 126

I'm very sorry about this - I don't have RStudio installed (new laptop), so I can't provide a code solution right now.

But one general approach to this is:

  • Take your end times, which is a list of values.
  • Generate a list of full-minute mark values, using seq, starting from 60000 all the way to your max end time. seq(60000, 195000, 60000)
  • Merge the 2 lists, then identify unique values, then sort the list.
  • This is your list of end times. Your list of start times is straightforwardly generated by just adding 0 to the start of this list and chopping off the last value.
  • Now you'll have the list of times, and need to get the appropriate speakers.
  • You can use ifelse logic to reassign the speakers based on whether they're in between a certain start/end time in your original dataframe. For example, if start is between 35000 and 65000 speaker is NA.
  • And you're done.

Upvotes: 0

Related Questions