Iterate over all documents in solr

Question

I have a solr instance with 200M+ documents. I would like to find an efficient way to iterate over all those documents.

I tried using the start parameter to formulate a list of queries:

http://ip:port/solr/docs/select?q=*:*&start=0&rows=1000000&fl=content&wt=python

http://ip:port/solr/docs/select?q=*:*&start=1000000&rows=1000000&fl=content&wt=python

...

But it is very slow when start gets too high.

I also tried using the cursorMark parameter with an initial query like this one:

http://ip:port/solr/docs/select?q=*:*&cursorMark=*&sort=id+asc&start=0&rows=1000000&fl=content&wt=python

which I believe try to sort all the documents first and crash the server. Sadly I don't think it is possible to bypass the sort. What would be the proper way to do it?

user3091275 · Accepted Answer

Okay, so I couldn't make it work with the cursor, even though it's probably me not knowing well enough how to use the tool. If you are having the same problem as me here are 3 tracks:

Track one: use cursor sorting using _docid_ as suggested by @femtoRgon. I couldn't make it work but I didn't have a lot of time to allocate to it.
Track two: use export handled as suggested by @Persimmonium
Track three (lazy track): what I did in the end is I keep using incremental start values, but I switch from wt=python to wt=csv, which is much faster and allows me to query by batches of 10M documents. This limits the amount of queries and the cost of using start instead of cursorMark is kind of amortized

Good luck, post your solutions if you find anything better.

Iterate over all documents in solr

Answers (2)

Related Questions