Reputation: 37
I'm having a long string where I would like to remove consecutive words with uppercase (2+ in a row) and if a punctation follows the last uppercase word, that as well. But at the same time I would like to keep single uppercase words and uppercase words that are part of a "mixed" word (see reprex).
I struggle to implement the consecutive word group in reprex.
string <- "Lorem ipsum DOLOR SIT AMET? consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, SAGITTIS ID? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam IN LOREM SIT amet leo accumsan"
#remove all consecutive UPPERCASE words including punctation (--> DOLOR SIT AMET?), but not single uppercase words (--> NEC) or "mixed" words with uppercase and digits (--> ETIAM-123)
#this doesn't work:
string %>%
stringr::str_remove_all("\\b[:upper:]+\\b")
#> [1] "Lorem ipsum ? consectetuer adipiscing elit. Morbi gravida libero velit. Morbi scelerisque luctus velit. -123 dui sem, fermentum vitae, ? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam amet leo accumsan"
Created on 2020-05-30 by the reprex package (v0.3.0)
Any hints are appreciated :)
Upvotes: 1
Views: 171
Reputation: 627536
You may use
string <- "Lorem ipsum DOLOR SIT AMET? consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, SAGITTIS ID? malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam IN LOREM SIT amet leo accumsan"
gsub("\\s*\\b\\p{Lu}{2,}(?:\\s+\\p{Lu}{2,})+\\b[\\p{P}\\p{S}]*", "", string, perl=TRUE)
Output:
[1] "Lorem ipsum consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam amet leo accumsan"
See the R demo and the regex demo.
Details
\s*
- 0 or more whitespaces\b
- word boundary\p{Lu}{2,}
- two or more capital letters(?:\s+\p{Lu}{2,})+
- 1 or more occurrences of 1+ whitespaces followed with 2 or more uppercase letters\b
- a word boundary[\p{P}\p{S}]*
- any 0 or more symbols or punctuation Upvotes: 5
Reputation: 174586
Perhaps this?
stringr::str_remove_all(string, "([[:upper:]]+ )+[[:upper:]]+( |[:punct:])*")
#> [1] "Lorem ipsum consectetuer adipiscing elit. Morbi gravida libero NEC velit. Morbi scelerisque luctus velit. ETIAM-123 dui sem, fermentum vitae, malesuada in, quam. Proin mattis lacinia justo. Vestibulum facilisis auctor urna. Aliquam amet leo accumsan"
Created on 2020-05-30 by the reprex package (v0.3.0)
Upvotes: 2