Remove consecutive uppercase words from string

Question

I'm having a long string where I would like to remove consecutive words with uppercase (2+ in a row) and if a punctation follows the last uppercase word, that as well. But at the same time I would like to keep single uppercase words and uppercase words that are part of a "mixed" word (see reprex).

I struggle to implement the consecutive word group in reprex.

string <- "Lorem ipsum DOLOR SIT AMET? consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, SAGITTIS ID? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam IN LOREM SIT amet leo accumsan"

#remove all consecutive UPPERCASE words including punctation (--> DOLOR SIT AMET?), but not single uppercase words (--> NEC) or "mixed" words with uppercase and digits (--> ETIAM-123)
#this doesn't work:
string %>% 
  stringr::str_remove_all("\b[:upper:]+\b")
#> [1] "Lorem ipsum   ? consectetuer adipiscing elit. Morbi gravida libero  velit. Morbi scelerisque luctus velit. -123 dui sem, fermentum vitae,  ? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam    amet leo accumsan"

^{Created on 2020-05-30 by the reprex package (v0.3.0)}

Any hints are appreciated :)

Wiktor Stribiżew · Accepted Answer

You may use

string <- "Lorem ipsum DOLOR SIT AMET? consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, SAGITTIS ID? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam IN LOREM SIT amet leo accumsan"
gsub("\s*\b\p{Lu}{2,}(?:\s+\p{Lu}{2,})+\b[\p{P}\p{S}]*", "", string, perl=TRUE)

Output:

[1] "Lorem ipsum  consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae,  malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam  amet leo accumsan"

See the R demo and the regex demo.

Details

\s* - 0 or more whitespaces
\b - word boundary
\p{Lu}{2,} - two or more capital letters
(?:\s+\p{Lu}{2,})+ - 1 or more occurrences of 1+ whitespaces followed with 2 or more uppercase letters
\b - a word boundary
[\p{P}\p{S}]* - any 0 or more symbols or punctuation

Remove consecutive uppercase words from string

Answers (2)

Related Questions