IRNotSmart
IRNotSmart

Reputation: 371

Formatting unusual (First and last Name) character strings in R

My character strings look like the following:

MLB$Name[1:6]
[1] "Wil Myers"   "Cory Spangenberg*"   "Alexei Ramirez #"   "Yangervis Solarte# (15-day dl)"   "Melvin Upton Jr."   "Travis d'Arnaud"

As you can see, these strings contain parentheses (), asterisks * and other unusual characters (#, d'Arnaud). I'm scraping these from a Baseball website, and they aren't coming out in a friendly usable fashion. All that I want to capture is the first and last name, with the first name being abbreviated (with a period after), followed by the last name. I don't want any usual characters, or Jr. of (15-day dl) after the names

I want my strings to look like this:

MLB$NameFormatted[1:6]
[1] "W. Myers"   "C. Spangenberg"   "A. Ramirez"   "Y. Solarte"   "M. Upton"               "T. d'Arnaud"

A previous question that I asked got an answer that was able to successfully take my character strings with only first and last names and format them into this form you see above, HOWEVER, the additional strange characters and things like *, #, and 15-day dl caused problems for this solution (expectedly). The following code was used to format the first and last names only:

sub("^(.)\\S+(\\s+.*)$", "\\1.\\2", MLB$Names)

I would really appreciate your help - I'm new to R and I'm trying to do some really interesting things with baseball statistics. Thank you for your time!

Upvotes: 0

Views: 96

Answers (1)

webb
webb

Reputation: 4340

This does that:

MLB$NameFormatted = sub("([A-Za-z])[A-Za-z']* ([A-Za-z' -]+[A-Za-z]+).*",'\\1. \\2', MLB$Name)

...as well as correctly handling troublemakers such as "Ryan Rowland-Smith" and "Valerio de los Santos"

Sample output:

[1] "W. Myers" "C. Spangenberg" "A. Ramirez " "Y. Solarte" "M. Upton Jr"
[6] "T. d'Arnaud" "R. Rowland-Smith" "V. de los Santos"

Upvotes: 1

Related Questions