Reputation: 612
I am just getting familiar with Mongodb, which is why I did something stupid. Each of my dataset's entries include a timestamp (they're Tweets). Instead of converting the timestamp from a string to an actual date format before inserting, I inserted it simply as a string.
Now, my dataset is becoming huge (3+ million Tweets), and I want to begin sorting/ranging my entries. Since my timestamp is still a string ("Wed Apr 29 09:52:22 +0000 2015"), I want to convert this to a date format.
I found the following code in this answer: How do I convert a property in MongoDB from text to date type?
> var cursor = db.ClockTime.find()
> while (cursor.hasNext()) {
... var doc = cursor.next();
... db.ClockTime.update({_id : doc._id}, {$set : {ClockInTime : new Date(doc.ClockInTime)}})
... }
And it works great. However, it is incredibly slow. According to the MongoHub app, it only processes 4 queries per second. With a dataset of 3+ million tweets, this will take approximately 8.6 days to convert. I really hope there is a way to speed this up, as my deadline is in 8 days :P
Any thoughts?
Upvotes: 4
Views: 8338
Reputation: 20722
Another option would be to use bulk operations, which are extremely fast, especially the unordered variant, since they can be applied in parallel.
var bulk = db.ClockTime.initializeUnorderedBulkOp()
var myDocs = db.ClockTime.find()
var ops = 0
myDocs.forEach(
function(myDoc) {
bulk.find(
{_id:myDoc._id}
).updateOne(
{$set : { ClockInTime: new Date(myDoc.ClockInTime) } }
);
if ( (++ops % 10000) === 0){
bulk.execute();
bulk = db.ClockTime.initializeUnorderedBulkOp();
}
}
)
bulk.execute()
Upvotes: 10
Reputation: 69773
By default, updates block until the database sent back an acknowledgment that it performed the update successfully. When you are working with a mongo shell on your local workstation and connect to a remote database, this will take at least as long as your ping to the database.
When you are allowed to do so, you could SSH into the database server (primary server for a replica-set) and run the script there. This reduces the network latency to almost zero. When you have a cluster, the result would likely still be an improvement, but not that much, because you need to log into the mongos server which still needs to wait from an acknowledgment from the replica-set(s) it routes your update to.
Another option is to perform the update with no write-concern. The program execution will then continue immediately which will improve speed drastically. But keep in mind that this way any errors are ignored.
db.ClockTime.update(
{_id : doc._id},
{$set : {ClockInTime : new Date(doc.ClockInTime)}},
{writeConcern: {w: 0}}
)
A third option which would be even faster would be to use mongoexport
to get a file export of your whole collection in JSON format, convert it with a local script, and then use mongoimport
to re-import the converted data. The drawback is that you won't be able to do this without a short downtime between export and import, because any data in between will be lost.
Upvotes: 6