odm_mc
odm_mc

Reputation: 33

Keeping only fully capitalized words in string using R

I have a dataset containing a vector of first and last names. I would like to remove the first names and keep only the last names. While both the last names and first names vary in the number of words, the last name(s) are always in uppercase and are before the first names, while only the first letter of the first name(s) is capitalized.

In other words, I have something like the following:

x <- c("AA AA Aa Aa", "BB BB Bb", "CC Cc Cc", "DD Dd")

And would like to have:

x
[1] "AA AA" "BB BB" "CC" "DD"    

I have been trying to do this with the stringr package, but it only returns to first capital letter of the first word:

library(stringr)
str_extract(x, "[A-Z]")
[1] "A" "B" "C" "D" 

Upvotes: 3

Views: 5509

Answers (1)

akrun
akrun

Reputation: 886948

We can use str_extract_all to extract all the capitalized substrings. The pattern used in the OP's post can only match a single capital letter. We need one or more ([A-Z]+) along with the word boundary (\\b). The output will be a list, which we can paste together by looping with sapply.

library(stringr)
sapply(str_extract_all(x, "\\b[A-Z]+\\b"), paste, collapse= ' ')
#[1] "AA AA" "BB BB" "CC"    "DD"   

Or using gsub

trimws(gsub("[[:alpha:]][a-z]+|[a-z][[:alpha:]]+", "", x))
#[1] "AA AA" "BB BB" "CC"    "DD"  

Using another vector

x1 <- c(x, "eE ee EE")
sapply(str_extract_all(x1, "\\b[A-Z]+\\b"), paste, collapse= ' ')
#[1] "AA AA" "BB BB" "CC"    "DD"    "EE"   

trimws(gsub("[[:alpha:]][a-z]+|[a-z][[:alpha:]]+", "", x1))
#[1] "AA AA" "BB BB" "CC"    "DD"    "EE"   

Upvotes: 4

Related Questions