Reputation: 21400
I have transcriptions of speech turns and the Part-of-Speech tags of the words used. Colloquial forms such "gonna" and "wanna" are rendered in the transcriptions as whitespace-separated tokens, namely "gon na" and "wan na". Contracting the separated word forms by deleting/replacing the whitespace - both in the speech turns and the tags - is not a problem. What is problematic is when a turn contains both the colloquial form (e.g., "gon na") and the standard form (e.g., "going to") because the tags for either form are identical, in the case of "gon na"/"going to" VVG TO0
and in the case of "wan na"/"want to" VVB TO0
.
So what I need to do is contract the tags only for the colloquial word forms but not for the equivalent standard forms.
Test data:
The speech turns are in column Turn
, the Part-of-Speech tags in column c5
:
df_test <- data.frame(
Turn = c("we 're not gon na know the person who 's going to listen .",
"right . do you wan na go shopping ? yes ? do you want to go shopping with me ?",
"do you just wan na walk ?",
"it 's gon na rain ."),
c5 = c("PNP VBB XX0 VVG TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI",
"AV0 VDB PNP VVB TO0 VVI VVG ITJ VDB PNP VVI TO0 VVI VVG PRP PNP",
"VDB PNP AV0 VVB TO0 VVI",
"PNP VBZ VVG TO0 VVI"), stringsAsFactors = FALSE
)
What I've done so far:
# define replacements:
tag_replacements <- setNames(c("VVB=TO0", "VVG=TO0"), # new forms
c("VVB TO0", "VVG TO0")) # old forms
# define pattern:
forms <- c("wan na", "gon na")
forms_pattern <- paste0("\\b(", paste0(forms, collapse = "|"), ")\\b")
# create new c5 column:
library(stringr)
df_test$c5_new <- ifelse(grepl(forms_pattern, df_test$Turn),
str_replace_all(df_test$c5[grepl(forms_pattern, df_test$Turn)], tag_replacements),
df_test$c5)
Result so far:
df_test$c5_new
[1] "PNP VBB XX0 VVG=TO0 VVI AT0 NN1 PNQ VBZ VVG=TO0 VVI"
[2] "AV0 VDB PNP VVB=TO0 VVI VVG ITJ VDB PNP VVB=TO0 VVI VVG PRP PNP"
[3] "VDB PNP AV0 VVB=TO0 VVI"
[4] "PNP VBZ VVG=TO0 VVI"
The expected result however is this (where the second occurrence of VVG
and TO0
in [1] and the second occurrence of VVB
and TO0
in [2] are kept separate:
[1] "PNP VBB XX0 VVG=TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI"
[2] "AV0 VDB PNP VVB=TO0 VVI VVG ITJ VDB PNP VVB TO0 VVI VVG PRP PNP"
[3] "VDB PNP AV0 VVB=TO0 VVI"
[4] "PNP VBZ VVG=TO0 VVI"
I'd be grateful for advice how to solve this issue (my hunch is that the position in Turn
and c5
must play a role so the function str_locate_all
comes to mind but don't really know how to operationalize this).
Upvotes: 0
Views: 61
Reputation: 4487
This is not exactly regex solution but I spend sometime to figure it out, wonder if this solution suit your needs
library(stringr)
df_test <- data.frame(
Turn = c("we 're not gon na know the person who 's going to listen .",
"right . do you wan na go shopping ? yes ? do you want to go shopping with me ?",
"do you just wan na walk ?",
"it 's gon na rain ."),
c5 = c("PNP VBB XX0 VVG TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI",
"AV0 VDB PNP VVB TO0 VVI VVG ITJ VDB PNP VVI TO0 VVI VVG PRP PNP",
"VDB PNP AV0 VVB TO0 VVI",
"PNP VBZ VVG TO0 VVI"), stringsAsFactors = FALSE
)
replace_pattern <- function(x, y) {
x_1 <- str_split(x, " ")[[1]]
x_1 <- x_1[!grepl("^[\\W]$", x_1, perl = TRUE)]
y_1 <- str_split(y, " ")[[1]]
replacement_list <- list(
list(first_world = "gon", second_world = "na", replacement = "VVG="),
list(first_world = "wan", second_world = "na", replacement = "VVB="))
for (item in replacement_list) {
first_world <- item[["first_world"]]
second_world <- item[["second_world"]]
replacement <- item[["replacement"]]
index <- x_1 == first_world
index_number <- which(index)
y_1[index_number] <- replacement
}
gsub(paste(y_1, collapse = " "), pattern = "= TO0", replacement = "=TO0"))
}
for (row in 1:nrow(df_test)) {
df_test[["c5"]][row] <- replace_pattern(x = df_test[["Turn"]][row],
y = df_test[["c5"]][row])
}
Output:
[1] "PNP VBB XX0 VVG=TO0 VVI AT0 NN1 PNQ VBZ VVG TO0 VVI"
[2] "AV0 VDB PNP VVB=TO0 VVI VVG ITJ VDB PNP VVI TO0 VVI VVG PRP PNP"
[3] "VDB PNP AV0 VVB=TO0 VVI"
[4] "PNP VBZ VVG=TO0 VVI"
Upvotes: 1