Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

How to sum up durations if certain patterns are found across columns

I have a dataframe with words and their durations in speech:

test1
       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.064 0.075 0.095 0.367 And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.087 0.088 0.264 0.092 0.249   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

The short form n't is treated as if it were a separate word. That's okay as long as the preceding word ends on a consonant such as did, but that's not okay if the preceding word ends on a vowel such do or ca. Because that separation into different words is incorrect the separation into different durations is incorrect too.

What I'd like to do is sum up the durations of ca and n't as well as doand n't but leave alone the separate durations for did and n't.

I know how to select the rows where the changes need to be implemented:

test1[which(grepl("(?<=(ca|do)\\s)n't", apply(test1, 1, paste0, collapse = " "), perl = T)),]

but I'm stuck going forward.

The desired result would look like this:

       d1    d2    d3    d4    d5    d6    d7    d8    d9   d10  w1       w2      w3  w4  w5    w6  w7  w8  w9  w10
10  0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367    NA And        I thought  oh  no Sarah  do n't  do   it
132 0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249    NA   I       du       n  no you    ca n't see his head
784 0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099 0.098  Oh honestly       I did n't touch  it   I did  n't

How can this be done? Help is much appreciated.

Reproducible data:

test1 <- structure(list(d1 = c(0.103, 0.091, 0.152), d2 = c(0.168, 0.072, 
                   0.341), d3 = c(0.198, 0.109, 0.117), d4 = c(0.188, 0.119, 0.108
                   ), d5 = c(0.359, 0.113, 0.123), d6 = c(0.343, 0.087, 0.263), 
                   d7 = c(0.064, 0.088, 0.083), d8 = c(0.075, 0.264, 0.095), 
                   d9 = c(0.095, 0.092, 0.099), d10 = c(0.367, 0.249, 0.098), 
                   w1 = c("And", "I", "Oh"), w2 = c("I", "du", "honestly"), 
                   w3 = c("thought", "n", "I"), w4 = c("oh", "no", "did"), w5 = c("no", 
                   "you", "n't"), w6 = c("Sarah", "ca", "touch"), w7 = c("do", 
                   "n't", "it"), w8 = c("n't", "see", "I"), w9 = c("do", "his", 
                   "did"), w10 = c("it", "head", "n't")), row.names = c(10L, 
                   132L, 784L), class = "data.frame")

Upvotes: 1

Views: 51

Answers (1)

lroha
lroha

Reputation: 34376

I think this is best done with data in long instead of wide format so you can take advantage of grouping operations:

library(dplyr)
library(tidyr)
library(tibble)

test1 %>%
  rownames_to_column() %>%
  pivot_longer(-rowname, names_to = c(".value", "number"), names_pattern = "(\\D)(\\d+)") %>%
  group_by(rowname) %>%
  mutate(wid = cumsum(!(lag(w) %in% c("ca", "do") & w == "n't"))) %>%
  group_by(rowname, wid) %>%
  summarise(d = sum(d),
            w = paste0(w, collapse = "")) %>%
  pivot_wider(names_from = wid, values_from = c(d, w), names_sep = "")

`summarise()` regrouping output by 'rowname' (override with `.groups` argument)
# A tibble: 3 x 21
# Groups:   rowname [3]
  rowname    d1    d2    d3    d4    d5    d6    d7    d8    d9    d10 w1    w2       w3      w4    w5    w6    w7    w8    w9    w10  
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <chr> <chr>    <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10      0.103 0.168 0.198 0.188 0.359 0.343 0.139 0.095 0.367 NA     And   I        thought oh    no    Sarah don't do    it    NA   
2 132     0.091 0.072 0.109 0.119 0.113 0.175 0.264 0.092 0.249 NA     I     du       n       no    you   can't see   his   head  NA   
3 784     0.152 0.341 0.117 0.108 0.123 0.263 0.083 0.095 0.099  0.098 Oh    honestly I       did   n't   touch it    I     did   n't  

Upvotes: 1

Related Questions