Reputation: 1023
I have a solr instance with 200M+ documents. I would like to find an efficient way to iterate over all those documents.
I tried using the start parameter to formulate a list of queries:
http://ip:port/solr/docs/select?q=*:*&start=0&rows=1000000&fl=content&wt=python
http://ip:port/solr/docs/select?q=*:*&start=1000000&rows=1000000&fl=content&wt=python
...
But it is very slow when start gets too high.
I also tried using the cursorMark parameter with an initial query like this one:
http://ip:port/solr/docs/select?q=*:*&cursorMark=*&sort=id+asc&start=0&rows=1000000&fl=content&wt=python
which I believe try to sort all the documents first and crash the server. Sadly I don't think it is possible to bypass the sort. What would be the proper way to do it?
Upvotes: 0
Views: 1709
Reputation: 1023
Okay, so I couldn't make it work with the cursor, even though it's probably me not knowing well enough how to use the tool. If you are having the same problem as me here are 3 tracks:
_docid_
as suggested by @femtoRgon. I couldn't make it work but I didn't have a lot of time to allocate to it.start
values, but I switch from wt=python
to wt=csv
, which is much faster and allows me to query by batches of 10M documents. This limits the amount of queries and the cost of using start
instead of cursorMark
is kind of amortized Good luck, post your solutions if you find anything better.
Upvotes: 0
Reputation: 15789
this is a very well known antipattern. You just need to use cursorMark feature to go deep into a result set.
if cursorMark is not doable then try the export handler
Upvotes: 2