Reputation: 2685
I've a very exotic problem I am not able to handle. I've a vector of strings, each element represent a sentence of a novel.
What I need to do is to collapse ONLY those lines which are inside the same dialogue. For example, take those lines:
snap <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
"He didn't seem at all upset at being almost knocked to the ground.",
"On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today!",
"Rejoice, for You-Know-Who has gone at last!",
"Even Muggles like yourself should be celebrating, this happy, happy day!\"",
"And the old man hugged Mr Dursley around the middle and walked off."
)
Lines 3 to 5 belongs to the same dialogue, so they must be collapsed with the resulting vector being:
snap.2 <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
"He didn't seem at all upset at being almost knocked to the ground.",
"On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!\"",
"And the old man hugged Mr Dursley around the middle and walked off."
)
I can detect unbalanced double quotes with:
which((str_count(snap, "\"") %% 2) != 0)
[3 5]
But then I have no idea on how to merge, like the example above, lines 3, 4 and 5
Any idea on how to do that?
Upvotes: 1
Views: 210
Reputation: 2685
It's probably not the best way to do (very ugly code) but it works. Basically:
which
in pairs (the couples will represent dialogue start and end offsets)In code:
dialogue.start <- which((str_count(snap, "\"") %% 2) != 0)
quotes.fill <- data.frame(dialogue.start) %>%
mutate(n = row_number())
quotes.fill$dialogue.end <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 1), NA)
quotes.fill$dialogue.next <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 2, default = NROW(snap)), NA)
quotes.fill$dialogue.before <- ifelse((quotes.fill$n %% 2) != 0, lag(quotes.fill$dialogue.start, 2, default = 0), NA)
quotes.fill <- quotes.fill %>% filter(!is.na(dialogue.end)) %>%
select(-n)
quotes.gaps <- do.call(rbind, lapply(split(quotes.fill, seq(nrow(quotes.fill))), function(x) {
prologue <- NULL
dialogue.hold <- seq(to = (x$dialogue.next - 1), from = (x$dialogue.end + 1))
dialogue.prologue <- seq(to = (x$dialogue.start - 1), from = (x$dialogue.before + 1))
if(x$dialogue.before == 0 & x$dialogue.start > 0) prologue <- data.frame(dialogue.start = dialogue.prologue, dialogue.end = dialogue.prologue, stringsAsFactors = FALSE)
if((x$dialogue.end + 1) >= x$dialogue.next) return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")]))
return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")], data.frame(dialogue.start = dialogue.hold, dialogue.end = dialogue.hold, stringsAsFactors = FALSE)))
})
)
snap.2 <- do.call(c, lapply(split(quotes.gaps, seq(nrow(quotes.gaps))), function(c, novel) {
paste(novel[c$dialogue.start:c$dialogue.end], collapse = " ")
}, novel = snap))
Upvotes: 1
Reputation: 887118
We could paste
them together and then split based on regex
out <- strsplit(paste(snap, collapse=' '), '(?<=\\.)\\s*|(?<=["])\\s', perl = TRUE)[[1]]
identical(out, snap.2)
#[1] TRUE
NOTE: It is not clear about the patterns.
Upvotes: 1