Jason Grotto
Jason Grotto

Reputation: 81

R: find position of specific character in data frame column

I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.

I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:

# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')

# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))

# takes the last/right part of the matrix, after the comma 
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)])) 

But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:

> test$LAST_NAME
 [1] "DEWITT. B"             "TAOY. PETER"           "ZULLO. JASON"         
 [4] "LAWLOR. JOSEPH"        "CRAWFORD. ADAM"        "HILL. ROBERT W."      
 [7] "TAGERT. CHRISTOPHER"   "ROSEBERY. SCOTT W."    "PAYNE. ALBERT"        
[10] "BUNTZ. BRIAN JOHN"     "COLON. PERFECTO GAUD"  "DIAZ. JOSE CANO"      
[13] "COLON. ERIK D."        "COLON. ERIK D."        "MARTINEZ. DAVID C."   
[16] "DRISKELL. JASON"       "JOHNSON. ALEXANDER"    "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J."        "FRANCO. BRANDT"        "FRANCO. BRANDT"  

For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.

left = function (string,char){
  substr(string,1,char)}

right = function (string, char){
  substr(string,nchar(string)-(char-1),nchar(string))}

I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.

test$LAST_NAME2 <- left(test$LAST_NAME, 
   which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)

I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.

My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.

I fear the answer is simple. But I'm stuck. Thanks so much for your help.

Upvotes: 2

Views: 7001

Answers (2)

Jacob F
Jacob F

Reputation: 366

I would just modify your original function namesplit to this:

 namesplit <- strsplit(crm$DEF_NAME, ',|\\.')

which will split on , or ..

Also, maybe change your first name function to

crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))

to catch any instances where there is a comma or period that is not in the last position.

Upvotes: 3

alistaire
alistaire

Reputation: 43334

With tidyr,

library(tidyr)

test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
##    LAST_NAME    FIRST_NAME
## 1     DEWITT             B
## 2     LAWLOR        JOSEPH
## 3     TAGERT   CHRISTOPHER
## 4      BUNTZ    BRIAN JOHN
## 5      COLON       ERIK D.
## 6   DRISKELL         JASON
## 7       SIPE      DAVID J.
## 8       TAOY         PETER
## 9   CRAWFORD          ADAM
## 10  ROSEBERY      SCOTT W.
## 11     COLON PERFECTO GAUD
## 12     COLON       ERIK D.
## 13   JOHNSON     ALEXANDER
## 14    FRANCO        BRANDT
## 15     ZULLO         JASON
## 16      HILL     ROBERT W.
## 17     PAYNE        ALBERT
## 18      DIAZ     JOSE CANO
## 19  MARTINEZ      DAVID C.
## 20   JACKSON  RONNIE WAYNE
## 21    FRANCO        BRANDT

Data

test <-  structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER", 
    "BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.", 
    "TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD", 
    "COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON", 
    "HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.", 
    "JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA, 
    -21L), class = "data.frame", .Names = "LAST_NAME")

Upvotes: 1

Related Questions