gmarais
gmarais

Reputation: 1891

R get first letters of double/tripple-barrel surnames in data.frame

I have a dataframe with 2 columns:

> df1
      Surname      Name
1 The Builder       Bob
2 Zeta-Jones Catherine

I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:

      Surname      Name Shortened_Surname
1 The Builder       Bob                TB
2  Zeta-Jones Catherine                ZJ

Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.

I have tried:

Step1:

> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The"     "Builder"

[[2]]
[1] "Zeta-Jones"

My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.

Upvotes: 2

Views: 101

Answers (2)

Jota
Jota

Reputation: 17611

You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":

with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"

or

with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))

The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".

Upvotes: 4

Gopala
Gopala

Reputation: 10483

You can try this, if the format of the names is as show in the input data:

library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))

Output is as follows:

      Surname      Name Shortened_Surname
1 The Builder       Bob                TB
2  Zeta-Jones Catherine                ZJ

If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

Upvotes: 1

Related Questions