Jibril
Jibril

Reputation: 1037

R - Update Values within Matrix while using Parallel code ( doParallel )

I have a matrix of many million values. One column is a weirdly formatted date, which I am converting to an actual datetime that I can sort.

I want to speed this up and do it in parallel. I've had success doing minor things before in Parallel, but that was easy because I wasn't actively changing an existing matrix.

How do I do this in parallel? I can't seem to figure it out...

The code I want to parallelize is...

len = dim(combinedDF)[1]
for(j in 1:len)
{
    sendTime = combinedDF[j, "tweetSendTime"]
    sendTime = gsub(" 0000", " +0000", sendTime)
    updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
    combinedDF[j, "tweetSendTime"] = toString(updatedTime)
}

EDIT : I was told to also try apply. I tried...

len = dim(combinedDF)[1]
### Using apply
apply(combinedDF,1, function(combinedDF,y){
sendTime = combinedDF[y, "tweetSendTime"]
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[y, "tweetSendTime"] = toString(updatedTime)
combinedDF[y,]
}, y=1:len)

However that nets an error when the }, processes, giving me "Error in combinedDF[y,"tweetSendTime"] -- incorrect number of dimensions.

Edit :

updateTime = function(timeList){
sendTime = timeList
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
toString(updatedTime)
} 


apply(as.matrix(combinedDF[,"tweetSendTime"]),1,updateTime)

Seems to work

Upvotes: 1

Views: 434

Answers (2)

Steve Weston
Steve Weston

Reputation: 19677

Since you're just modifying a single column of combinedDF, and gsub and strptime are vector functions, you don't need to use a loop or any kind of "apply" function:

sendTime <- gsub(" 0000", " +0000", combinedDF[, "tweetSendTime"])
updatedTime <- strptime(sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[, "tweetSendTime"] <- as.character(updatedTime)

Note that I used as.character since it is a vector function, while toString is not.

Upvotes: 1

David Go
David Go

Reputation: 840

I usually use doParallel for parallel execution:

library(doParallel)
ClusterCount = 2 # depends on the threads you want to use
cl <- makeCluster(ClusterCount)
registerDoParallel(cl)
len = dim(combinedDF)[1]
combinedDF <- foreach(j = 1:len,.combine = rbind) %dopar% {
    sendTime = combinedDF[j, "tweetSendTime"]
    sendTime = gsub(" 0000", " +0000", sendTime)
    updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
    combinedDF[j, "tweetSendTime"] = toString(updatedTime)
    combinedDF[j,]
}
stopCluster(cl)

however it should be mentioned that what you are doing does not seem to be computationally expensive, but requieres many iterations. You should consider rewriting your code, as loops are not very fast in R and that an apply() based attempt should speed up your code more than a parallel attempt.

Upvotes: 0

Related Questions