Reputation: 21410
I have data with speech events timed from start
to end
in milliseconds:
df <- data.frame(
speaker = c(NA, "A", NA, "B", NA, "C", NA, "C"),
start = c(0,20000,35000,65000,80000,100000,110000,140000),
end = c(20000,35000,65000,80000,100000,110000,140000,195000)
)
I need to add rows when start
and end
times of a row straddle a full-minute mark, while clipping the end
time of the straddling row to the exact full-minute mark (60000
, 120000
etc.) and assigning the 'rest' to the end
value in the added row. The desired output would be:
speaker start end
1 <NA> 0 20000
2 A 20000 35000
3 <NA> 35000 60000 # clipped end time
4 <NA> 60000 65000 # added row
5 B 65000 80000
6 <NA> 80000 100000
7 C 100000 110000
8 <NA> 110000 120000 # clipped end time
9 <NA> 120000 140000 # added row
10 C 140000 120000 # clipped end time
11 C 120000 195000 # added row
EDIT:
Here's my (long-winded) solution:
Step 1: Create a dataframe in which the minute-straddling row is flagged:
df0 <- df %>%
mutate(
minute_start = as.integer(start/60000),
minute_end = as.integer(end/60000),
straddler = minute_end > minute_start)
Step 2: Create another dataframe subsetted on the flagged rows, where the end
value is mutate
d to the exact minute mark:
df1 <- df0 %>%
filter(straddler=="TRUE") %>%
mutate(end = minute_end*60000)
Step 3: Create another dataframe subsetted on the flagged rows, where the start
value is mutate
d to the exact minute mark:
df2 <- df0 %>%
filter(straddler=="TRUE") %>%
mutate(start = minute_end*60000)
Step 4: Create another dataframe without the flagged rows:
df3 <- df0 %>%
filter(!straddler == "TRUE")
Step 5: Finally, rbind
the the dataframes, arrange
ing them by start
:
arrange(rbind(df1, df2, df3), start)
speaker start end minute_start minute_end straddler
1 <NA> 0 20000 0 0 FALSE
2 A 20000 35000 0 0 FALSE
3 <NA> 35000 60000 0 1 TRUE
4 <NA> 60000 65000 0 1 TRUE
5 B 65000 80000 1 1 FALSE
6 <NA> 80000 100000 1 1 FALSE
7 C 100000 110000 1 1 FALSE
8 <NA> 110000 120000 1 2 TRUE
9 <NA> 120000 140000 1 2 TRUE
10 C 140000 180000 2 3 TRUE
11 C 180000 195000 2 3 TRUE
If anybody knows how to improve this, it'd be much appreciated!
Upvotes: 0
Views: 61
Reputation: 2626
Interesting question! I came up with this:
library(dplyr)
library(purrr)
library(tibble)
mutate(df, across(-speaker, ~ floor(.x / 60000) + 1, .names = "mnt_{.col}")) %>%
pmap(\(...) {
with(list(...), if (mnt_start == mnt_end) tibble(speaker, start, end)
else tibble(speaker = rep(speaker, 2),
start = c(start, mnt_start * 60000),
end = c(mnt_start * 60000, end)))
}) %>%
bind_rows()
Returning:
# A tibble: 11 x 3
speaker start end
<chr> <dbl> <dbl>
1 NA 0 20000
2 A 20000 35000
3 NA 35000 60000
4 NA 60000 65000
5 B 65000 80000
6 NA 80000 100000
7 C 100000 110000
8 NA 110000 120000
9 NA 120000 140000
10 C 140000 180000
11 C 180000 195000
Upvotes: 1
Reputation: 126
I'm very sorry about this - I don't have RStudio installed (new laptop), so I can't provide a code solution right now.
But one general approach to this is:
Upvotes: 0