Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:

Michael, Jordan; Bird, Larry;

If the name is a single word, the code would be like this:

breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")

Result with this code:

Michael, J.; Bird, L.;

The problem is in compound names. With this code, the name:

Jordan, Michael Larry;

It would be:

Jordan, Michael L.;

Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:

Jordan, M.L.;

Upvotes: 4

Views: 91

Answers (4)

mt1022
mt1022

Reputation: 17299

Here is another solution:

x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'

gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"

gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"

Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.

To remove the space between M. and L., an additional step is needed:

gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"

Upvotes: 1

d.b
d.b

Reputation: 32548

Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.

x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

Upvotes: 0

Benjamin Ye
Benjamin Ye

Reputation: 518

There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.

library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))

This code generates the tibble df, which has the following contents:

# A tibble: 4 x 2
  V1     V2   
  <chr>  <chr>
1 Jordan M.   
2 Bird   L.   
3 Obama  B.   
4 Bush   G. W.

The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .

Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".

...also, did you mean Michael Jordan, not Jordan Michael? lol

Upvotes: 0

teofil
teofil

Reputation: 2394

There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.

library(stringr)
library(dplyr)
library(tidyr)

ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"

c(ballers, mj) %>% 
#split the players
  str_split(., ";", simplify = TRUE) %>% 
# remove white space
  str_trim() %>% 
#transpose to get players in a column
  t %>% 
#split again into last name and first + middle (if any)
  str_split(",", simplify = TRUE) %>% 
# convert to a tibble
  as_tibble() %>% 
# remove more white space
  mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
  filter(!V1 == "") %>% 
# name the columns
  rename("Last"=V1, "First_two"=V2) %>% 
# separate the given names into first and middle (if any)
  separate(First_two,into=c("First", "Middle"), sep=" ",) %>% 
# abbreviate to first letter
  mutate(First_i=abbreviate(First, 1)) %>% 
# abbreviate, but take into account that middle name might be missing
  mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>% 
# combine the First and middle initals
  mutate(Initials=paste(First_i, Middle_i, sep=".")) %>% 
# make the desired Last, F.M. vector
  mutate(Final=paste(Last, Initials, sep=", "))

# A tibble: 3 x 7
  Last    First   Middle First_i Middle_i Initials Final       
  <chr>   <chr>   <chr>  <chr>   <chr>    <chr>    <chr>       
1 Michael Jordan  NA     J       ""       J.       Michael, J. 
2 Jordan  Michael Larry  M       L.       M.L.     Jordan, M.L.
3 Bird    Larry   NA     L       ""       L.       Bird, L.    
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].

Much longer than a regex.

Upvotes: 0

Related Questions