Reputation: 21400
I have a dataframe with words and their durations in speech:
test1
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
10 0.103 0.168 0.198 0.188 0.359 0.343 0.064 0.075 0.095 0.367 And I thought oh no Sarah do n't do it
132 0.091 0.072 0.109 0.119 0.113 0.087 0.088 0.264 0.092 0.249 I du n no you ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098 Oh honestly I did n't touch it I did n't
The short form n't
is treated as if it were a separate word. That's okay as long as the preceding word ends on a consonant such as did
, but that's not okay if the preceding word ends on a vowel such do
or ca
. Because that separation into different words is incorrect the separation into different durations is incorrect too.
What I'd like to do is sum up the durations of ca
and n't
as well as do
and n't
but leave alone the separate durations for did
and n't
.
I know how to select the rows where the changes need to be implemented:
test1[which(grepl("(?<=(ca|do)\\s)n't", apply(test1, 1, paste0, collapse = " "), perl = T)),]
but I'm stuck going forward.
The desired result would look like this:
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
10 0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367 NA And I thought oh no Sarah do n't do it
132 0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249 NA I du n no you ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098 Oh honestly I did n't touch it I did n't
How can this be done? Help is much appreciated.
Reproducible data:
test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072,
0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263),
d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095),
d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098),
w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"),
w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no",
"you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do",
"n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his",
"did"), w10 = c("it", "head", "n't")), row.names = c(10L,
132L, 784L), class = "data.frame")
Upvotes: 1
Views: 51
Reputation: 34376
I think this is best done with data in long instead of wide format so you can take advantage of grouping operations:
library(dplyr)
library(tidyr)
library(tibble)
test1 %>%
rownames_to_column() %>%
pivot_longer(-rowname, names_to = c(".value", "number"), names_pattern = "(\\D)(\\d+)") %>%
group_by(rowname) %>%
mutate(wid = cumsum(!(lag(w) %in% c("ca", "do") & w == "n't"))) %>%
group_by(rowname, wid) %>%
summarise(d = sum(d),
w = paste0(w, collapse = "")) %>%
pivot_wider(names_from = wid, values_from = c(d, w), names_sep = "")
`summarise()` regrouping output by 'rowname' (override with `.groups` argument)
# A tibble: 3 x 21
# Groups: rowname [3]
rowname d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10 0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367 NA And I thought oh no Sarah don't do it NA
2 132 0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249 NA I du n no you can't see his head NA
3 784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098 Oh honestly I did n't touch it I did n't
Upvotes: 1