Reputation: 31166
I'm storing pandas dataframes in Redis serialising them using pyarrow. This is working well. I want to make this data available to Jupyter notebooks via flask. This works fine on localhost but fails when running on AWS EB.
Flask code
@app.route('/cacheget/<path:key>', methods=['GET'])
def cacheget(key):
c = mycache()
resp = Response(BytesIO(c.redis().get(key)), mimetype="text/plain", direct_passthrough=True)
resp.headers["key"] = key
resp.headers["type"] = c.redis().get(f"{key}.type")
return resp
Jupyter tests to flask running on localhost and AWS EB
I suspect there is an issue with bytes content being incomplete when pyarrow deserialises it. However I cannot see or find any evidence or find any other posts which are related to this. I am considering switching from pyarrow serialised data on the wire to JSON. i.e. in flask route convert the serialised bytes to pandas and then to json. This however will be at least 10x bigger on the wire.
Are my http headers correctly set for this? Are there any known issues with sending bytes data like this over the wire?
Upvotes: 0
Views: 771
Reputation: 31166
The issue was with incompatible versions of pyarrow. AWS EB instance was running 0.14.1 and 0.16.0 on the jupyter client. Downgraded client to 0.14.1 and reset Redis caches on localhost so that pandas data frames are serialised in local Redis cache using pyarrow 0.14.1. base64 encoding is not necessary and increases payload by at least 20%. I arrived at this conclusion be doing sys.getsizeof() in flask and putting in headers and then doing same on bytes data read in jupyter
Upvotes: 1