Reputation: 81
I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.
I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:
# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')
# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))
# takes the last/right part of the matrix, after the comma
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)]))
But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:
> test$LAST_NAME
[1] "DEWITT. B" "TAOY. PETER" "ZULLO. JASON"
[4] "LAWLOR. JOSEPH" "CRAWFORD. ADAM" "HILL. ROBERT W."
[7] "TAGERT. CHRISTOPHER" "ROSEBERY. SCOTT W." "PAYNE. ALBERT"
[10] "BUNTZ. BRIAN JOHN" "COLON. PERFECTO GAUD" "DIAZ. JOSE CANO"
[13] "COLON. ERIK D." "COLON. ERIK D." "MARTINEZ. DAVID C."
[16] "DRISKELL. JASON" "JOHNSON. ALEXANDER" "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J." "FRANCO. BRANDT" "FRANCO. BRANDT"
For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.
left = function (string,char){
substr(string,1,char)}
right = function (string, char){
substr(string,nchar(string)-(char-1),nchar(string))}
I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.
test$LAST_NAME2 <- left(test$LAST_NAME,
which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)
I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.
My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.
I fear the answer is simple. But I'm stuck. Thanks so much for your help.
Upvotes: 2
Views: 7001
Reputation: 366
I would just modify your original function namesplit
to this:
namesplit <- strsplit(crm$DEF_NAME, ',|\\.')
which will split on ,
or .
.
Also, maybe change your first name function to
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))
to catch any instances where there is a comma or period that is not in the last position.
Upvotes: 3
Reputation: 43334
With tidyr,
library(tidyr)
test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
## LAST_NAME FIRST_NAME
## 1 DEWITT B
## 2 LAWLOR JOSEPH
## 3 TAGERT CHRISTOPHER
## 4 BUNTZ BRIAN JOHN
## 5 COLON ERIK D.
## 6 DRISKELL JASON
## 7 SIPE DAVID J.
## 8 TAOY PETER
## 9 CRAWFORD ADAM
## 10 ROSEBERY SCOTT W.
## 11 COLON PERFECTO GAUD
## 12 COLON ERIK D.
## 13 JOHNSON ALEXANDER
## 14 FRANCO BRANDT
## 15 ZULLO JASON
## 16 HILL ROBERT W.
## 17 PAYNE ALBERT
## 18 DIAZ JOSE CANO
## 19 MARTINEZ DAVID C.
## 20 JACKSON RONNIE WAYNE
## 21 FRANCO BRANDT
Data
test <- structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER",
"BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.",
"TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD",
"COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON",
"HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.",
"JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA,
-21L), class = "data.frame", .Names = "LAST_NAME")
Upvotes: 1