codedancer
codedancer

Reputation: 1634

Get a new column with only the first name in R

I want to create a column to have only the first name of the people in the dataset. In this case, I just want to get a column with value John, David, Carey, and David and NA values for those who are either non-human or don't have one. However, I am facing two difficulties.

The first is I need to filter out all those rows with captial letters. Because they're not PEOPLE; they're ENTITIES.

The second is I need to extract the word right before the comma, as those are the first name.

So I am just wondering what's the best approach to get a new column for the first name of the people.

reproducible dataset

structure(list(company_number = c("04200766", "04200766", "04200766", 
"04200766", "04200766", "04200766"), directors = c("THOMAS, John Anthony", 
"THOMAS, David Huw", "BRIGHTON SECRETARY LIMITED", "THOMAS, Carey Rosaline", 
"THOMAS, David Huw", "BRIGHTON DIRECTOR LIMITED")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"))

Upvotes: 0

Views: 52

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

Using str_extract :

library(dplyr)
library(stringr)

df %>% mutate(people = str_extract(directors, '(?<=,\\s)\\w+'))

#   company_number                  directors people
#1:       04200766       THOMAS, John Anthony   John
#2:       04200766          THOMAS, David Huw  David
#3:       04200766 BRIGHTON SECRETARY LIMITED   <NA>
#4:       04200766     THOMAS, Carey Rosaline  Carey
#5:       04200766          THOMAS, David Huw  David
#6:       04200766  BRIGHTON DIRECTOR LIMITED   <NA>

Upvotes: 1

Daniel O
Daniel O

Reputation: 4358

we can do this:

first take the first word after a comma

df$names <- sub(".*?, (.*?) .*","\\1",df$directors)

then take any strings with more than one word and make it <NA>

df$names <- ifelse(sapply(strsplit(df$names, " "), length)>1,NA,df$names)

output:

> df
  company_number                  directors names
1       04200766       THOMAS, John Anthony  John
2       04200766          THOMAS, David Huw David
3       04200766 BRIGHTON SECRETARY LIMITED  <NA>
4       04200766     THOMAS, Carey Rosaline Carey
5       04200766          THOMAS, David Huw David
6       04200766  BRIGHTON DIRECTOR LIMITED  <NA>

Upvotes: 3

Related Questions