Gabriele B
Gabriele B

Reputation: 2685

Merge (collapse) some elements of a vector of strings but not all in R

I've a very exotic problem I am not able to handle. I've a vector of strings, each element represent a sentence of a novel.

What I need to do is to collapse ONLY those lines which are inside the same dialogue. For example, take those lines:

snap <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
      "He didn't seem at all upset at being almost knocked to the ground.",
      "On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today!",
      "Rejoice, for You-Know-Who has gone at last!",
      "Even Muggles like yourself should be celebrating, this happy, happy day!\"",
      "And the old man hugged Mr Dursley around the middle and walked off."
      )

Lines 3 to 5 belongs to the same dialogue, so they must be collapsed with the resulting vector being:

snap.2 <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
      "He didn't seem at all upset at being almost knocked to the ground.",
      "On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!\"",
      "And the old man hugged Mr Dursley around the middle and walked off."
      )

I can detect unbalanced double quotes with:

which((str_count(snap, "\"") %% 2) != 0)
[3 5]

But then I have no idea on how to merge, like the example above, lines 3, 4 and 5

Any idea on how to do that?

Upvotes: 1

Views: 210

Answers (2)

Gabriele B
Gabriele B

Reputation: 2685

It's probably not the best way to do (very ugly code) but it works. Basically:

  1. Split the output of the which in pairs (the couples will represent dialogue start and end offsets)
  2. Find the offesets of the previous and next dialogue using lead and lag from dplyr
  3. Fill the gaps with dummy couples where dialogue.start = dialogue.end for not dialogue lines
  4. Use the output dataset as index for pasting

In code:

dialogue.start <- which((str_count(snap, "\"") %% 2) != 0)

quotes.fill <- data.frame(dialogue.start) %>%
  mutate(n = row_number())

quotes.fill$dialogue.end <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 1), NA)
quotes.fill$dialogue.next <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 2, default = NROW(snap)), NA)
quotes.fill$dialogue.before <- ifelse((quotes.fill$n %% 2) != 0, lag(quotes.fill$dialogue.start, 2, default = 0), NA)


quotes.fill <- quotes.fill %>% filter(!is.na(dialogue.end)) %>%
  select(-n)

quotes.gaps <- do.call(rbind, lapply(split(quotes.fill, seq(nrow(quotes.fill))), function(x) { 

  prologue <- NULL

  dialogue.hold <- seq(to = (x$dialogue.next - 1), from = (x$dialogue.end + 1))
  dialogue.prologue <- seq(to = (x$dialogue.start - 1), from = (x$dialogue.before + 1))

  if(x$dialogue.before == 0 & x$dialogue.start > 0) prologue <- data.frame(dialogue.start = dialogue.prologue, dialogue.end = dialogue.prologue, stringsAsFactors = FALSE)

  if((x$dialogue.end + 1) >= x$dialogue.next) return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")]))


  return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")], data.frame(dialogue.start = dialogue.hold, dialogue.end = dialogue.hold, stringsAsFactors = FALSE)))
})
)

snap.2 <- do.call(c, lapply(split(quotes.gaps, seq(nrow(quotes.gaps))), function(c, novel) {
  paste(novel[c$dialogue.start:c$dialogue.end], collapse = " ")
}, novel = snap))

Upvotes: 1

akrun
akrun

Reputation: 887118

We could paste them together and then split based on regex

out <- strsplit(paste(snap, collapse=' '), '(?<=\\.)\\s*|(?<=["])\\s', perl = TRUE)[[1]]
identical(out, snap.2)
#[1] TRUE

NOTE: It is not clear about the patterns.

Upvotes: 1

Related Questions