Contract strings in one column based on strings in another column

Question

I have transcriptions of speech turns and the Part-of-Speech tags of the words used. Colloquial forms such "gonna" and "wanna" are rendered in the transcriptions as whitespace-separated tokens, namely "gon na" and "wan na". Contracting the separated word forms by deleting/replacing the whitespace - both in the speech turns and the tags - is not a problem. What is problematic is when a turn contains both the colloquial form (e.g., "gon na") and the standard form (e.g., "going to") because the tags for either form are identical, in the case of "gon na"/"going to" VVG TO0and in the case of "wan na"/"want to" VVB TO0. So what I need to do is contract the tags only for the colloquial word forms but not for the equivalent standard forms.

Test data:

The speech turns are in column Turn, the Part-of-Speech tags in column c5:

df_test <- data.frame(
  Turn = c("we 're not gon na know the person who 's going to listen .",
           "right . do you wan na go shopping ? yes ? do you want to go shopping with me ?",
           "do you just wan na walk ?",
           "it 's gon na rain ."),
  c5 = c("PNP VBB XX0 VVG TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI",
         "AV0 VDB PNP VVB TO0 VVI VVG ITJ VDB PNP VVI TO0 VVI VVG PRP PNP",
         "VDB PNP AV0 VVB TO0 VVI",
         "PNP VBZ VVG TO0 VVI"), stringsAsFactors = FALSE
)

What I've done so far:

# define replacements:
tag_replacements <- setNames(c("VVB=TO0", "VVG=TO0"),   # new forms
                             c("VVB TO0", "VVG TO0"))   # old forms

# define pattern:
forms <- c("wan na", "gon na")
forms_pattern <- paste0("\b(", paste0(forms, collapse = "|"), ")\b")

# create new c5 column:
library(stringr)
df_test$c5_new <- ifelse(grepl(forms_pattern, df_test$Turn),
                         str_replace_all(df_test$c5[grepl(forms_pattern, df_test$Turn)], tag_replacements),
                         df_test$c5)

Result so far:

df_test$c5_new
[1] "PNP VBB XX0 VVG=TO0 VVI AT0 NN1 PNQ VBZ VVG=TO0 VVI"             
[2] "AV0 VDB PNP VVB=TO0 VVI VVG ITJ VDB PNP VVB=TO0 VVI VVG PRP PNP"
[3] "VDB PNP AV0 VVB=TO0 VVI"                                         
[4] "PNP VBZ VVG=TO0 VVI"

The expected result however is this (where the second occurrence of VVG and TO0 in [1] and the second occurrence of VVB and TO0 in [2] are kept separate:

[1] "PNP VBB XX0 VVG=TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI"             
[2] "AV0 VDB PNP VVB=TO0 VVI VVG ITJ VDB PNP VVB TO0 VVI VVG PRP PNP"
[3] "VDB PNP AV0 VVB=TO0 VVI"                                         
[4] "PNP VBZ VVG=TO0 VVI"

I'd be grateful for advice how to solve this issue (my hunch is that the position in Turn and c5 must play a role so the function str_locate_allcomes to mind but don't really know how to operationalize this).

Contract strings in one column based on strings in another column

Answers (1)

Related Questions