Reputation: 47310
I want to change this:
input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")
into this :
output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")
The only words that should be modified are those that are all upper case.
Edit:
An edge case, where first letter of sequence isn't in [A-Z]
:
input <- "Philippe Fabre d'ÉGLANTINE"
Upvotes: 2
Views: 1573
Reputation: 556
You could also use the snakecase pkg and specifically set sep_in = " "
to not delete non-alphanumerics like '
(default is sep_in = "[^[:alnum:]]"
):
library(snakecase)
input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")
output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")
to_title_case(input, sep_in = " ")
#> [1] "Théodore Agrippa d'Aubigné" "Vital d'Audiguier De La Menor"
identical(to_title_case(input, sep_in = " "), output)
#> [1] TRUE
Created on 2019-08-01 by the reprex package (v0.3.0)
This works because
sep_in
.snakecase::to_title_case()
first applies snakecase::to_sentence_case()
which separates words by " " and afterwards wraps the (lower case) result inside tools::toTitleCase()
which doesn't capitalize lone standing "d"'s, i.e. " d ' aubigné" becomes " d ' Aubigné". '
). (For numeric characters the behaviour can be adjusted via the numerals
argument).Upvotes: 0
Reputation: 33488
Here is an alternative solution:
gsub("(?<=\\p{L})(\\p{L}+)", "\\L\\1", input, perl = TRUE)
I'm not trying to compete with the other existing answers, I just solved (or tried) for the challenge and share it here because it might be useful for someone and/or I get constructive feedback on how it could be improved.
Edit
I had for some reason skipped over:
only words [...] that are all upper case
I think the following deals a bit better with that:
gsub("(?<=\\b\\p{Lu})(\\p{Lu}+\\b)", "\\L\\1", input, perl = TRUE)
Upvotes: 3
Reputation: 47310
A general answer that detects all upper case characters and works whatever the encoding, would be :
input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR", "Philippe Fabre d'ÉGLANTINE")
gsub("(*UCP)\\b(\\p{Lu})(\\p{Lu}+)\\b", "\\1\\L\\2", input, perl = TRUE)
# [1] "Théodore Agrippa d'Aubigné" "Vital d'Audiguier De La Menor" "Philippe Fabre d'Églantine"
credits go to @Wiktor-Stribiżew
\p{Lu}
detects any Unicode upper case character, the second one can be replaced by \w
to allow underscores and numbers (would give same output here).
(*UCP)
is not necessary to reproduce the result here but will come handy if the encoding of the input string is different from native encoding. It makes the pattern "Unicode-aware" in Wiktors's words.
Upvotes: 3
Reputation: 43169
Form two groups with boundaries on both sides as in
\b([A-Z])(\w+)\b
and use tolower
on the second group (leaving the first untouched).
See a demo on regex101.com (and mind the modifiers, especially u
).
Upvotes: 1