moodymudskipper
moodymudskipper

Reputation: 47310

Convert upper case words to title case

I want to change this:

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")

into this :

output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")

The only words that should be modified are those that are all upper case.

Edit:

An edge case, where first letter of sequence isn't in [A-Z]:

input <- "Philippe Fabre d'ÉGLANTINE"

Upvotes: 2

Views: 1573

Answers (4)

Taz
Taz

Reputation: 556

You could also use the snakecase pkg and specifically set sep_in = " " to not delete non-alphanumerics like ' (default is sep_in = "[^[:alnum:]]"):

library(snakecase)

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")
output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")

to_title_case(input, sep_in = " ")
#> [1] "Théodore Agrippa d'Aubigné"    "Vital d'Audiguier De La Menor"

identical(to_title_case(input, sep_in = " "), output)
#> [1] TRUE

Created on 2019-08-01 by the reprex package (v0.3.0)

This works because

  1. snakecase treats special characters as words, when they are not specified as input separators via sep_in.
  2. snakecase::to_title_case() first applies snakecase::to_sentence_case() which separates words by " " and afterwards wraps the (lower case) result inside tools::toTitleCase() which doesn't capitalize lone standing "d"'s, i.e. " d ' aubigné" becomes " d ' Aubigné".
  3. snakecase always "protects" its output, i.e. it cleanes up the messy and probably not intended output separators (here " ") around non-alphanumeric characters (here '). (For numeric characters the behaviour can be adjusted via the numerals argument).

Upvotes: 0

s_baldur
s_baldur

Reputation: 33488

Here is an alternative solution:

gsub("(?<=\\p{L})(\\p{L}+)", "\\L\\1", input, perl = TRUE)

I'm not trying to compete with the other existing answers, I just solved (or tried) for the challenge and share it here because it might be useful for someone and/or I get constructive feedback on how it could be improved.

Edit

I had for some reason skipped over:

only words [...] that are all upper case

I think the following deals a bit better with that:

gsub("(?<=\\b\\p{Lu})(\\p{Lu}+\\b)", "\\L\\1", input, perl = TRUE)

Upvotes: 3

moodymudskipper
moodymudskipper

Reputation: 47310

A general answer that detects all upper case characters and works whatever the encoding, would be :

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR", "Philippe Fabre d'ÉGLANTINE")
gsub("(*UCP)\\b(\\p{Lu})(\\p{Lu}+)\\b", "\\1\\L\\2", input, perl = TRUE)
# [1] "Théodore Agrippa d'Aubigné"    "Vital d'Audiguier De La Menor" "Philippe Fabre d'Églantine"

credits go to @Wiktor-Stribiżew

\p{Lu} detects any Unicode upper case character, the second one can be replaced by \w to allow underscores and numbers (would give same output here).

(*UCP) is not necessary to reproduce the result here but will come handy if the encoding of the input string is different from native encoding. It makes the pattern "Unicode-aware" in Wiktors's words.

Upvotes: 3

Jan
Jan

Reputation: 43169

Form two groups with boundaries on both sides as in

\b([A-Z])(\w+)\b

and use tolower on the second group (leaving the first untouched).
See a demo on regex101.com (and mind the modifiers, especially u).


As a side note: you still have a couple of questions with (not yet accepted) answers.

Upvotes: 1

Related Questions