Andy Stein
Andy Stein

Reputation: 471

R - mutate for string processing - not getting the behavior I was hoping for

I'm trying to use mutate in dplyr to process strings and I'm not getting the output that I want (see below) where instead of operating line by line, mutate is taking the first element and populating it downward. I was wondering if someone could help me understand what I'm doing wrong and how to tweak this code to work properly.

short.idfun = function(longid) 
{
    x      = strsplit(longid,"_")
    y      = x[[1]]
    study  = substr(y[1],8,nchar(y[1]))
    subj   = y[length(y)]
    subj   = substr(subj,regexpr("[^0]",subj),nchar(subj)) #remove leading zeros
    shortid= paste(study,subj,sep="-")
    return(shortid)
}

data = data.frame(test=c("1234567Andy_003_003003","1234567Beth_004_003004","1234567Char_003_003005"),stringsAsFactors=FALSE)
data= mutate(data,shortid=short.idfun(test))
print(data)

#### Below is my output
#                       test   shortid
#1    1234567Andy_003_003003 Andy-3003
#2    1234567Beth_004_003004 Andy-3003
#3    1234567Char_003_003005 Andy-3003

#### This is the behavior I was hoping for
#                       test   shortid
#1    1234567Andy_003_003003 Andy-3003
#2    1234567Beth_004_003004 Beth-3004
#3    1234567Char_003_003005 Char-3005

Upvotes: 3

Views: 270

Answers (2)

Steven Beaupré
Steven Beaupré

Reputation: 21621

Another alternative is the use of rowwise():

data %>%
  rowwise() %>% 
  mutate(shortid = short.idfun(test))

Which gives:

#Source: local data frame [3 x 2]
#Groups: <by row>
#
#                    test   shortid
#                   (chr)     (chr)
#1 1234567Andy_003_003003 Andy-3003
#2 1234567Beth_004_003004 Beth-3004
#3 1234567Char_003_003005 Char-3005

Upvotes: 1

Benjamin
Benjamin

Reputation: 17369

The issue is that your function needs a little help vectorizing. You can run it through vapply to get what you're after.

data = data.frame(test=c("1234567Andy_003_003003","1234567Beth_004_003004","1234567Char_003_003005"),stringsAsFactors=FALSE)
data= mutate(data,
             shortid=vapply(test, short.idfun, character(1)))
print(data)

To see why you got the result you did, we can look at little at the first few lines of your function.

longid = data$test
(x <- strsplit(longid, "_"))
[[1]]
[1] "1234567Andy" "003"         "003003"     

[[2]]
[1] "1234567Beth" "004"         "003004"     

[[3]]
[1] "1234567Char" "003"         "003005" 

Everything looks good so far, but now you define y.

(y      = x[[1]])

[1] "1234567Andy" "003"         "003003" 

By calling x[[1]], you pulled out only the first element of x, not the first vector in x, not the first element of each vector in x. You could also revise your function by defining y <= vapply(x, function(v) v[1], character(1)) and skip the vapply in mutate. Either way should work.

Upvotes: 0

Related Questions