Solr fetch documents as CSV via cursor over HTTP

Question

Solr has a great streaming feature that allows to fetch large number documents quickly via cursors (e.g. solrdump allows to use this feature from the command line).

For these kind of cursor queries, it is possible to set the wt parameter as well to control the serialization, with the default being xml as of Solr 5.5.

$ curl -v "http://solr/select?cursorMark=*&fl=...&q=...&sort=id+asc&wt=json"
                                                                    -------

However, for the streaming queries to work, one must parse out the nextCursorMark from each response. With JSON and XML this is just another field, but with wt set to csv this information cannot be accommodated (at least not in the payload).

{
  ...
  },
  "nextCursorMark": "AoE4YWktNDgtUTAxRlgxODFNakU1TVRFeU1R"
}

My first thought was that nextCursorMark will probably sent in the HTTP header, but it seems it is not.

$ curl -v http://ex.index/solr/select?...wt=csv
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Last-Modified: Fri, 05 Jan 2018 00:04:25 GMT
< ETag: "ZTM2MDQ4MDAwMDAwMDAwMFNvbHI="
< Content-Type: text/plain; charset=UTF-8
< Transfer-Encoding: chunked
<
----8<---- body ----8<----

Is it possible to use this kind of query with the CSV format? I am curious, because I would expect slight performance wins, if both sender and receiver can just use CSV instead of JSON or XML.

Update:

It seems that there is some information (status, query time) already put into the header, in SolrCore.java. Maybe only used with ADMIN, also in V2 - V2 docs.

Persimmonium · Accepted Answer

as you can see in this thread, that feature is not supported. If you have some kind of unique id you can sort on, you can just roll your own cursorMark as explained there (still using wt=csv). I did it for a mass migration of close to a billion docs, worked perfectly.

Important caveat, index must not be being written to, or if it is:

the id you use should be only increasing for new docs (like a timestamp or sequence id in a db)
if not, you have to take care of any doc inserted after you started in a second pass (you will double process some docs probably)

Solr fetch documents as CSV via cursor over HTTP

Answers (1)

Related Questions