user3091275
user3091275

Reputation: 1023

Iterate over all documents in solr

I have a solr instance with 200M+ documents. I would like to find an efficient way to iterate over all those documents.

I tried using the start parameter to formulate a list of queries:

http://ip:port/solr/docs/select?q=*:*&start=0&rows=1000000&fl=content&wt=python

http://ip:port/solr/docs/select?q=*:*&start=1000000&rows=1000000&fl=content&wt=python

...

But it is very slow when start gets too high.

I also tried using the cursorMark parameter with an initial query like this one:

http://ip:port/solr/docs/select?q=*:*&cursorMark=*&sort=id+asc&start=0&rows=1000000&fl=content&wt=python

which I believe try to sort all the documents first and crash the server. Sadly I don't think it is possible to bypass the sort. What would be the proper way to do it?

Upvotes: 0

Views: 1709

Answers (2)

user3091275
user3091275

Reputation: 1023

Okay, so I couldn't make it work with the cursor, even though it's probably me not knowing well enough how to use the tool. If you are having the same problem as me here are 3 tracks:

  • Track one: use cursor sorting using _docid_ as suggested by @femtoRgon. I couldn't make it work but I didn't have a lot of time to allocate to it.
  • Track two: use export handled as suggested by @Persimmonium
  • Track three (lazy track): what I did in the end is I keep using incremental start values, but I switch from wt=python to wt=csv, which is much faster and allows me to query by batches of 10M documents. This limits the amount of queries and the cost of using start instead of cursorMark is kind of amortized

Good luck, post your solutions if you find anything better.

Upvotes: 0

Persimmonium
Persimmonium

Reputation: 15789

this is a very well known antipattern. You just need to use cursorMark feature to go deep into a result set.

if cursorMark is not doable then try the export handler


Upvotes: 2

Related Questions