Reputation: 21440
The data I have is a vector with sentences cut into pieces.
y <- c("G'day", "world and everybody", "else.", "How's life?", "Hope", "you're", "doing just", "fine.")
I'd like to put the sentences back together.
Expected result:
y
[1] "G'day world and everybody else."
[2] "How's life?"
[3] "Hope you're doing just fine."
The 'rule' for there to be a sentence is that it starts with an upper-case letter. Building on this rule, what I've tried so far is this (but the result is anything but satisfactory):
unlist(strsplit(paste0(y[which(grepl("^[A-Z]", y))], " ", y[which(grepl("^[a-z]", y))], collapse = ","), ","))
[1] "G'day world and everybody" "How's life? else." "Hope you're" "G'day doing just"
[5] "How's life? fine."
EDIT:
Have come up with this solution, which does give the expected result but looks ugly:
y1 <- c(paste0(y[grepl("^[A-Z].*[^.?]$", y, perl = T)], " ", unlist(strsplit(paste0(y[which(grepl("^[a-z]", y))], collapse = " "), "\\."))), y[grepl("^[A-Z].*[.?]$", y, perl = T)])
y1
[1] "G'day world and everybody else" "Hope you're doing just fine" "How's life?"
What better solution is there?
EDIT 2:
Also a good solution is this:
library(stringr)
str_extract_all(paste(y, collapse = " "), "[A-Z][^.?]*(\\.|\\?)")
Upvotes: 1
Views: 42
Reputation: 174586
I would use a gsub
to insert a new line before each capital, then split at new lines:
unlist(strsplit(gsub(" ([A-Z])", "\n\\1", paste(y, collapse = " ")), "\n"))
#> [1] "G'day world and everybody else." "How's life?"
#> [3] "Hope you're doing just fine."
Created on 2020-05-28 by the reprex package (v0.3.0)
Upvotes: 2