user2627717
user2627717

Reputation: 344

Extract last word in string before the first comma

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.

I am trying to sort the list into:

FirstName LastName Titles Mark Owens M.D.,M.P.H Lara Kraft - Dale Good C.P.A

Thanks in advance.

Here is my sample code:

namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )

You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word

Upvotes: 0

Views: 1067

Answers (2)

Pierre L
Pierre L

Reputation: 28461

You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.

namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)

names <- data.frame(firstnames , lastnames, titles )
  firstnames lastnames        titles
1       Mark     Owens  M.D., M.P.H.
2       Dale      Good         C.P.A
3       Lara     Kraft             -
4     Roland      Bass           III

Upvotes: 1

Maksim Gayduk
Maksim Gayduk

Reputation: 1082

This should do the trick, at least on test data:

x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x =     x),how="replace")

names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
    paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])

names <- data.frame(firstnames, lastnames, titles )
names

In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Upvotes: 1

Related Questions