Reputation: 1634
I want to create a column to have only the first name of the people in the dataset. In this case, I just want to get a column with value John, David, Carey, and David
and NA
values for those who are either non-human or don't have one. However, I am facing two difficulties.
The first is I need to filter out all those rows with captial letters. Because they're not PEOPLE; they're ENTITIES.
The second is I need to extract the word right before the comma, as those are the first name.
So I am just wondering what's the best approach to get a new column for the first name of the people.
reproducible dataset
structure(list(company_number = c("04200766", "04200766", "04200766",
"04200766", "04200766", "04200766"), directors = c("THOMAS, John Anthony",
"THOMAS, David Huw", "BRIGHTON SECRETARY LIMITED", "THOMAS, Carey Rosaline",
"THOMAS, David Huw", "BRIGHTON DIRECTOR LIMITED")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
Upvotes: 0
Views: 52
Reputation: 388982
Using str_extract
:
library(dplyr)
library(stringr)
df %>% mutate(people = str_extract(directors, '(?<=,\\s)\\w+'))
# company_number directors people
#1: 04200766 THOMAS, John Anthony John
#2: 04200766 THOMAS, David Huw David
#3: 04200766 BRIGHTON SECRETARY LIMITED <NA>
#4: 04200766 THOMAS, Carey Rosaline Carey
#5: 04200766 THOMAS, David Huw David
#6: 04200766 BRIGHTON DIRECTOR LIMITED <NA>
Upvotes: 1
Reputation: 4358
we can do this:
first take the first word after a comma
df$names <- sub(".*?, (.*?) .*","\\1",df$directors)
then take any strings with more than one word and make it <NA>
df$names <- ifelse(sapply(strsplit(df$names, " "), length)>1,NA,df$names)
output:
> df
company_number directors names
1 04200766 THOMAS, John Anthony John
2 04200766 THOMAS, David Huw David
3 04200766 BRIGHTON SECRETARY LIMITED <NA>
4 04200766 THOMAS, Carey Rosaline Carey
5 04200766 THOMAS, David Huw David
6 04200766 BRIGHTON DIRECTOR LIMITED <NA>
Upvotes: 3