Fast loading of a Solr streaming response (JSON) into Polars

Question

I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:

{
  "result-set":{
    "docs":[{
       "col1":"value",
       "col2":"value"}
    ,{
       "col1":"value",
       "col2":"value"}
    ...
    ,{
       "EOF":true,
       "RESPONSE_TIME":12345}]}}

That is: I need every element of result-set.docs-- except for the last one, which marks the end of the response.

For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson:

cat result.json | jstream -d 3 | head -n -1 > result.ndjson

This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1), clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.

So, I wonder whether there is any fast way to import this file without leaving python?

Fast loading of a Solr streaming response (JSON) into Polars

Answers (1)

Related Questions