Lars Noschinski
Lars Noschinski

Reputation: 3667

Fast loading of a Solr streaming response (JSON) into Polars

I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:

{
  "result-set":{
    "docs":[{
       "col1":"value",
       "col2":"value"}
    ,{
       "col1":"value",
       "col2":"value"}
    ...
    ,{
       "EOF":true,
       "RESPONSE_TIME":12345}]}}

That is: I need every element of result-set.docs-- except for the last one, which marks the end of the response.

For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson:

cat result.json | jstream -d 3 | head -n -1 > result.ndjson

This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1), clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.

So, I wonder whether there is any fast way to import this file without leaving python?

Upvotes: 0

Views: 62

Answers (1)

JanHoy
JanHoy

Reputation: 316

This is a classic stream / buffer reading issue. Instead of bulk processing the entire streaming response from Solr, the intention is that the client will read it chunk by chunk and make sense of it as you go.

I have not tried this myself but there are streaming json parsers for the Python ecosystem (https://pypi.org/project/json-stream/) which seems to fit the bill at a glance. I believe you will be able to configure it so that your code consumes each doc at a time while still reading the stream from your streaming request.

Good luck

-- Jan Høydahl - Apache Solr committer

Upvotes: 0

Related Questions