Reputation: 381
Update on update:
Solved! See this: MongoDB: cannot iterate through all data with cursor (because data is corrupted)
It's caused by corrupted data set. Not MongoDB or the driver.
I'm using the latest Java driver(2.11.3) of MongoDB(2.4.6). I've got a collection with ~250M records and I want to use a cursor to iterate through all of them. However, after 10 minutes or so I got either a false cursor.hasNext(), or an exception saying that the cursor does not exist on server.
After that I learned about cursor timeout and wrapped my cursor.next() with try/catch. If any exception, or hasNext() returned false before iterating through all the records, the program closes the cursor and allocates a new one, and then skip right back into context.
But later on I read about cursor.skip() performance issues... And the program just reached ~20M records, and cursor.next() after cursor.skip() throwed out "java.util.NoSuchElementException". I believe that's because the skip operation has timed out, which invalidated the cursor.
Yes I've read about skip() performance issues and cursor timeout issues... But now I think I'm in a dilemma where fixing one will break the other.
So, is there a way to gracefully iterate through all the data in a huge dataset?
@mnemosyn has already pointed out that I have to rely on range-based queries. But the problem is that I want to split all the data into 16 parts and process them on different machines, and the data is not uniformly distributed within any monotonic key space. If load balancing is desired, there must be a way to calculate how many keys are in a particular range and balance them. My goal is to partition them into 16 parts, so I have to find the quartiles of quartiles (sorry, I don't know if there's a mathematical term for this) of the keys and use them to split data.
Is there a way to achieve this?
I do have some ideas when the first seek is achieved by obtaining the partition boundary keys. If the new cursor times out again, I can simply record the latest tweetID and jump back in with the new range. However, the range query should be fast enough or otherwise I still get timeouts. I'm not confident about this...
Update:
Problem solved! I didn't realise that I don't have to partition data in a chunky way. A round-robin job dispatcher will do. See comments in the accepted answer.
Upvotes: 4
Views: 3135
Reputation: 46291
In general, yes. If you have a monotonic field, ideally an indexed field, you can simply walk along that. For instance, if you're using fields of type ObjectId
as primary key or if you have a CreatedDate
or something, you can simply use an $lt
query, take a fixed number of elements, then query again using $lt
of the smallest _id
or CreatedDate
you encountered in the previous batch.
Be careful about strict monotonic behavior vs. non-strict monotonic behavior: You might have to use $lte
if the keys aren't strict, then prevent doing things twice on the dupes. Since the _id
field is unique, ObjectIds
are always strictly monotonic.
If you don't have such a key, things are a little more tricky. You can still iterate 'along the index' (whatever index, be it a name, a hash, a UUID, Guid, etc.). That works just as well, but it's hard to do snapshotting, because you never know whether the result you have just found was inserted before you started to traverse, or not. Also, when documents are inserted at the beginning of the traversal, those will be missed.
Upvotes: 1