Reputation: 33
I have a dataset containing a vector of first and last names. I would like to remove the first names and keep only the last names. While both the last names and first names vary in the number of words, the last name(s) are always in uppercase and are before the first names, while only the first letter of the first name(s) is capitalized.
In other words, I have something like the following:
x <- c("AA AA Aa Aa", "BB BB Bb", "CC Cc Cc", "DD Dd")
And would like to have:
x
[1] "AA AA" "BB BB" "CC" "DD"
I have been trying to do this with the stringr package, but it only returns to first capital letter of the first word:
library(stringr)
str_extract(x, "[A-Z]")
[1] "A" "B" "C" "D"
Upvotes: 3
Views: 5509
Reputation: 886948
We can use str_extract_all
to extract all the capitalized substrings. The pattern used in the OP's post can only match a single capital letter. We need one or more ([A-Z]+
) along with the word boundary (\\b
). The output will be a list
, which we can paste
together by looping with sapply
.
library(stringr)
sapply(str_extract_all(x, "\\b[A-Z]+\\b"), paste, collapse= ' ')
#[1] "AA AA" "BB BB" "CC" "DD"
Or using gsub
trimws(gsub("[[:alpha:]][a-z]+|[a-z][[:alpha:]]+", "", x))
#[1] "AA AA" "BB BB" "CC" "DD"
Using another vector
x1 <- c(x, "eE ee EE")
sapply(str_extract_all(x1, "\\b[A-Z]+\\b"), paste, collapse= ' ')
#[1] "AA AA" "BB BB" "CC" "DD" "EE"
trimws(gsub("[[:alpha:]][a-z]+|[a-z][[:alpha:]]+", "", x1))
#[1] "AA AA" "BB BB" "CC" "DD" "EE"
Upvotes: 4