Reputation: 1037
I have a matrix of many million values. One column is a weirdly formatted date, which I am converting to an actual datetime that I can sort.
I want to speed this up and do it in parallel. I've had success doing minor things before in Parallel, but that was easy because I wasn't actively changing an existing matrix.
How do I do this in parallel? I can't seem to figure it out...
The code I want to parallelize is...
len = dim(combinedDF)[1]
for(j in 1:len)
{
sendTime = combinedDF[j, "tweetSendTime"]
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[j, "tweetSendTime"] = toString(updatedTime)
}
EDIT : I was told to also try apply. I tried...
len = dim(combinedDF)[1]
### Using apply
apply(combinedDF,1, function(combinedDF,y){
sendTime = combinedDF[y, "tweetSendTime"]
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[y, "tweetSendTime"] = toString(updatedTime)
combinedDF[y,]
}, y=1:len)
However that nets an error when the }, processes, giving me "Error in combinedDF[y,"tweetSendTime"] -- incorrect number of dimensions.
Edit :
updateTime = function(timeList){
sendTime = timeList
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
toString(updatedTime)
}
apply(as.matrix(combinedDF[,"tweetSendTime"]),1,updateTime)
Seems to work
Upvotes: 1
Views: 434
Reputation: 19677
Since you're just modifying a single column of combinedDF
, and gsub
and strptime
are vector functions, you don't need to use a loop or any kind of "apply" function:
sendTime <- gsub(" 0000", " +0000", combinedDF[, "tweetSendTime"])
updatedTime <- strptime(sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[, "tweetSendTime"] <- as.character(updatedTime)
Note that I used as.character
since it is a vector function, while toString
is not.
Upvotes: 1
Reputation: 840
I usually use doParallel for parallel execution:
library(doParallel)
ClusterCount = 2 # depends on the threads you want to use
cl <- makeCluster(ClusterCount)
registerDoParallel(cl)
len = dim(combinedDF)[1]
combinedDF <- foreach(j = 1:len,.combine = rbind) %dopar% {
sendTime = combinedDF[j, "tweetSendTime"]
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[j, "tweetSendTime"] = toString(updatedTime)
combinedDF[j,]
}
stopCluster(cl)
however it should be mentioned that what you are doing does not seem to be computationally expensive, but requieres many iterations. You should consider rewriting your code, as loops are not very fast in R and that an apply()
based attempt should speed up your code more than a parallel attempt.
Upvotes: 0