Chris Ruehlemann
Chris Ruehlemann

Reputation: 21442

Summarize grouped character data with true NA in dplyr

I have speech data with utterances by same-speaker that I want to collapse:

df <- structure(list(Line = 1:7,
                     Speaker = c("ID01.A", NA, "ID01.C", "ID01.C", "ID01.A", "ID01.A", "ID01.A"), 
                     Utterance = c("how old's your mom¿", 
                                   "(0.855)", 
                                   "eh six:ty:::-one=", 
                                   "[when was] that¿=", 
                                   "[yes]",                               # collapse with ...
                                   "(0.163)",                             # ... this and ...
                                   "=!this! was on °Wednesday°"),         # ... that
                     Sequ = c(1L, 1L,1L, 2L, 2L, 2L, 2L), 
                     c7 = c("how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                            NA, 
                            "eh_UH sixty-one_MC", 
                            "when_RRQ was_VBDZ that_DD1", 
                            "yes_UH",                                     # collapse with ...
                            NA,                                           # ... and ...
                            "this_DD1 was_VBDZ on_II Wednesday_NPD1"),    # ... that
                     N_c7 = c(12L,NA, 2L, 3L, 1L, NA, 4L)), 
                row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

I'm doing well as far the summarizing/collapsing of Utterance and N_c7 are concerned. Just column c7poses a problem, namely the presence of (true) NA in Line 6. This presence prevents the summarize operation - unless I first convert true NA to character "NA":

library(dplyr)
library(data.table)
df %>%
  mutate(
    # convert true NA to character "NA":
    c7 = ifelse(is.na(c7), "NA", c7)
  ) %>%
  # group:
  group_by(grp = rleid(Speaker, Sequ)) %>%
  # summarise:
  summarise(
    # across Line, Speaker, Sequ, by taking the first value:
    across(c(Line, Speaker, Sequ), first),
    # collapse same-speaker `Utterance`s:
    Utterance = str_c(Utterance, collapse = ' '),     
    # collapse same-speaker `c7` data:
    c7 = str_c(c7, collapse = ' '),
    # sum same-speaker `N_c7` values:
    N_c7 = sum(N_c7, na.rm = TRUE)
           ) %>%
  # deactivate grouping:
  ungroup() %>%
  select(-grp)
# A tibble: 5 × 6
   Line Speaker  Sequ Utterance                                                c7                                                             N_c7
  <int> <chr>   <int> <chr>                                                    <chr>                                                         <int>
1     1 ID01.A      1 >like I don't understand< sorry like how old's your mom¿ like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_…    12
2     2 NA          1 (0.855)                                                  NA                                                                0
3     3 ID01.C      1 eh six:ty:::-one=                                        eh_UH sixty-one_MC                                                2
4     4 ID01.C      2 [when was] that¿=                                        when_RRQ was_VBDZ that_DD1                                        3
5     5 ID01.A      2 [yes] (0.163) =!this! was on °Wednesday°                 yes_UH NA this_DD1 was_VBDZ on_II Wednesday_NPD1                  5

The conversion of NA to "NA" however is suboptimal, as all NA values in the column are mutated to character including those I need to keep as true NA. How can same-sepaker c7 be collapsed without conversion of NA to "NA"?

Upvotes: 1

Views: 156

Answers (2)

Donald Seinen
Donald Seinen

Reputation: 4419

I read the mention of true NA as drop those where a speaker has text, but keep those where a speaker does not. To this extent, na.omit can be used, later wrangling back empty strings to NA.

df %>%
  group_by(grp = rleid(Speaker, Sequ)) %>%
  summarise(
    across(c(Line, Speaker, Sequ), first),
    Utterance = str_c(Utterance, collapse = ' '),     
    c7 = na_if(str_c(na.omit(c7), collapse = ' '), ""),
    N_c7 = sum(N_c7, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  select(-grp)

   Line Speaker  Sequ Utterance                                c7                                     N_c7
  <int> <chr>   <int> <chr>                                    <chr>                                 <int>
1     1 ID01.A      1 how old's your mom¿                      how_RGQ old_JJ 's_VBZ your_APPGE mom~    12
2     2 NA          1 (0.855)                                  NA                                        0
3     3 ID01.C      1 eh six:ty:::-one=                        eh_UH sixty-one_MC                        2
4     4 ID01.C      2 [when was] that¿=                        when_RRQ was_VBDZ that_DD1                3
5     5 ID01.A      2 [yes] (0.163) =!this! was on °Wednesday° yes_UH this_DD1 was_VBDZ on_II Wedne~     5

Upvotes: 2

Sinh Nguyen
Sinh Nguyen

Reputation: 4497

I think the issue you got is due to the way str_c is implemented, which intentionally result NA in case there is one NA in the input. If you repelace str_c with paste you would got the same result without the need to convert NA to character.

Your code

question_output <- df %>%
  mutate(
    # convert true NA to character "NA":
    c7 = ifelse(is.na(c7), "NA", c7)
  ) %>%
  # group:
  group_by(grp = rleid(Speaker, Sequ)) %>%
  # summarise:
  summarise(
    # across Line, Speaker, Sequ, by taking the first value:
    across(c(Line, Speaker, Sequ), first),
    # collapse same-speaker `Utterance`s:
    Utterance = str_c(Utterance, collapse = ' '),     
    # collapse same-speaker `c7` data:
    c7 = str_c(c7, collapse = ' '),
    # sum same-speaker `N_c7` values:
    N_c7 = sum(N_c7, na.rm = TRUE)
  ) %>%
  # deactivate grouping:
  ungroup() %>%
  select(-grp)

my trial with paste

trial_output <- df %>%
  group_by(grp = rleid(Speaker, Sequ)) %>%
  # summarise:
  summarise(
    # across Line, Speaker, Sequ, by taking the first value:
    across(c(Line, Speaker, Sequ), first),
    # collapse same-speaker `Utterance`s:
    Utterance = paste(Utterance, collapse = ' '),     
    # collapse same-speaker `c7` data:
    c7 = paste(c7, collapse = ' '),
    # sum same-speaker `N_c7` values:
    N_c7 = sum(N_c7, na.rm = TRUE)
  ) %>%
  # deactivate grouping:
  ungroup() %>%
  select(-grp)

The result is identical

identical(question_output, trial_output)
#> [1] TRUE

Created on 2022-01-11 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions