Reputation: 21442
I have speech data with utterances by same-speaker that I want to collapse:
df <- structure(list(Line = 1:7,
Speaker = c("ID01.A", NA, "ID01.C", "ID01.C", "ID01.A", "ID01.A", "ID01.A"),
Utterance = c("how old's your mom¿",
"(0.855)",
"eh six:ty:::-one=",
"[when was] that¿=",
"[yes]", # collapse with ...
"(0.163)", # ... this and ...
"=!this! was on °Wednesday°"), # ... that
Sequ = c(1L, 1L,1L, 2L, 2L, 2L, 2L),
c7 = c("how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
NA,
"eh_UH sixty-one_MC",
"when_RRQ was_VBDZ that_DD1",
"yes_UH", # collapse with ...
NA, # ... and ...
"this_DD1 was_VBDZ on_II Wednesday_NPD1"), # ... that
N_c7 = c(12L,NA, 2L, 3L, 1L, NA, 4L)),
row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
I'm doing well as far the summarizing/collapsing of Utterance
and N_c7
are concerned. Just column c7
poses a problem, namely the presence of (true) NA
in Line
6. This presence prevents the summarize operation - unless I first convert true NA
to character "NA"
:
library(dplyr)
library(data.table)
df %>%
mutate(
# convert true NA to character "NA":
c7 = ifelse(is.na(c7), "NA", c7)
) %>%
# group:
group_by(grp = rleid(Speaker, Sequ)) %>%
# summarise:
summarise(
# across Line, Speaker, Sequ, by taking the first value:
across(c(Line, Speaker, Sequ), first),
# collapse same-speaker `Utterance`s:
Utterance = str_c(Utterance, collapse = ' '),
# collapse same-speaker `c7` data:
c7 = str_c(c7, collapse = ' '),
# sum same-speaker `N_c7` values:
N_c7 = sum(N_c7, na.rm = TRUE)
) %>%
# deactivate grouping:
ungroup() %>%
select(-grp)
# A tibble: 5 × 6
Line Speaker Sequ Utterance c7 N_c7
<int> <chr> <int> <chr> <chr> <int>
1 1 ID01.A 1 >like I don't understand< sorry like how old's your mom¿ like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_… 12
2 2 NA 1 (0.855) NA 0
3 3 ID01.C 1 eh six:ty:::-one= eh_UH sixty-one_MC 2
4 4 ID01.C 2 [when was] that¿= when_RRQ was_VBDZ that_DD1 3
5 5 ID01.A 2 [yes] (0.163) =!this! was on °Wednesday° yes_UH NA this_DD1 was_VBDZ on_II Wednesday_NPD1 5
The conversion of NA
to "NA"
however is suboptimal, as all NA
values in the column are mutated to character including those I need to keep as true NA
. How can same-sepaker c7
be collapsed without conversion of NA
to "NA"
?
Upvotes: 1
Views: 156
Reputation: 4419
I read the mention of true NA as drop those where a speaker has text, but keep those where a speaker does not. To this extent, na.omit
can be used, later wrangling back empty strings to NA
.
df %>%
group_by(grp = rleid(Speaker, Sequ)) %>%
summarise(
across(c(Line, Speaker, Sequ), first),
Utterance = str_c(Utterance, collapse = ' '),
c7 = na_if(str_c(na.omit(c7), collapse = ' '), ""),
N_c7 = sum(N_c7, na.rm = TRUE)
) %>%
ungroup() %>%
select(-grp)
Line Speaker Sequ Utterance c7 N_c7
<int> <chr> <int> <chr> <chr> <int>
1 1 ID01.A 1 how old's your mom¿ how_RGQ old_JJ 's_VBZ your_APPGE mom~ 12
2 2 NA 1 (0.855) NA 0
3 3 ID01.C 1 eh six:ty:::-one= eh_UH sixty-one_MC 2
4 4 ID01.C 2 [when was] that¿= when_RRQ was_VBDZ that_DD1 3
5 5 ID01.A 2 [yes] (0.163) =!this! was on °Wednesday° yes_UH this_DD1 was_VBDZ on_II Wedne~ 5
Upvotes: 2
Reputation: 4497
I think the issue you got is due to the way str_c
is implemented, which intentionally result NA
in case there is one NA
in the input. If you repelace str_c
with paste
you would got the same result without the need to convert NA
to character.
question_output <- df %>%
mutate(
# convert true NA to character "NA":
c7 = ifelse(is.na(c7), "NA", c7)
) %>%
# group:
group_by(grp = rleid(Speaker, Sequ)) %>%
# summarise:
summarise(
# across Line, Speaker, Sequ, by taking the first value:
across(c(Line, Speaker, Sequ), first),
# collapse same-speaker `Utterance`s:
Utterance = str_c(Utterance, collapse = ' '),
# collapse same-speaker `c7` data:
c7 = str_c(c7, collapse = ' '),
# sum same-speaker `N_c7` values:
N_c7 = sum(N_c7, na.rm = TRUE)
) %>%
# deactivate grouping:
ungroup() %>%
select(-grp)
trial_output <- df %>%
group_by(grp = rleid(Speaker, Sequ)) %>%
# summarise:
summarise(
# across Line, Speaker, Sequ, by taking the first value:
across(c(Line, Speaker, Sequ), first),
# collapse same-speaker `Utterance`s:
Utterance = paste(Utterance, collapse = ' '),
# collapse same-speaker `c7` data:
c7 = paste(c7, collapse = ' '),
# sum same-speaker `N_c7` values:
N_c7 = sum(N_c7, na.rm = TRUE)
) %>%
# deactivate grouping:
ungroup() %>%
select(-grp)
The result is identical
identical(question_output, trial_output)
#> [1] TRUE
Created on 2022-01-11 by the reprex package (v2.0.1)
Upvotes: 1