Reputation: 344
I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
Upvotes: 0
Views: 1067
Reputation: 28461
You were off to a good start so you should pick up from there. The firstnames
variable was good as written. For lastnames
I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles
there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -
.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
Upvotes: 1
Reputation: 1082
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts
Upvotes: 1