F.Raab
F.Raab

Reputation: 783

Better understand SQLalchemy's `yield_per()` problems

To cite SQLalchemy documentation:

The Query.yield_per() method is not compatible with most eager loading schemes, including subqueryload and joinedload with collections.

Warning

Use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.

In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=’joined’ or ‘subquery’) since those collections will be cleared for a new load when encountered in a subsequent result batch. In the case of ‘subquery’ loading, the full result for all rows is fetched which generally defeats the purpose of yield_per().

Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available. The memory use of raw database rows is much less than that of an ORM-mapped object, but should still be taken into consideration when benchmarking.

I really have a problem understanding how yield_per() works and what exactly is the problem on using this method. Also what is the right way to workaround these problems and keep using this function for iterating over a huge amount of rows.

I'm interested in all constructive information you have, but here are some hint questions:

Upvotes: 29

Views: 18882

Answers (2)

vimi Liu
vimi Liu

Reputation: 1

Becareful that sometime yield_per will fail, not sure about the root case is yield_per or not. (U can test by setting the yield number as a magic number, such as 51, 17)

sqlalchemy.exc.ProgrammingError: (_mysql_exceptions.ProgrammingError) (2014, "Commands out of sync; you can't run this command now")

DB: mysql:5.7.0

Upvotes: -4

Eevee
Eevee

Reputation: 48564

Both of the problematic loading strategies raise exceptions if you try to use them with yield_per, so you don't really have to worry too much.

I believe the only problem with subqueryload is that batched loading of the second query isn't implemented (yet). Nothing would go wrong semantically, but if you're using yield_per, you probably have a really good reason not to want to load all the results at once. So SQLAlchemy politely refuses to go against your wishes.

joinedload is a little more subtle. It's only forbidden in the case of a collection, where a primary row might have multiple associated rows. Say your query produces raw results like this, where A and B are primary keys from different tables:

 A | B 
---+---
 1 | 1 
 1 | 2 
 1 | 3 
 1 | 4 
 2 | 5 
 2 | 6 

Now you fetch these with yield_per(3). The problem is that SQLAlchemy can only limit how much it fetches by rows, but it has to return objects. Here, SQLAlchemy only sees the first three rows, so it creates an A object with key 1 and three B children: 1, 2, and 3.

When it loads the next batch, it wants to create a new A object with key 1... ah, but it already has one of those, so no need to create it again. The extra B, 4, is lost. (So, no, even reading joined collections with yield_per is unsafe — chunks of your data might go missing.)

You might say "well, just keep reading rows until you have a full object" — but what if that A has a hundred children? Or a million? SQLAlchemy can't reasonably guarantee that it can do what you asked and produce correct results, so it refuses to try.


Remember that the DBAPI is designed so that any database can be used with the same API, even if that database doesn't support all the DBAPI features. Consider that the DBAPI is designed around cursors, but MySQL doesn't actually have cursors! The DBAPI adapters for MySQL have to fake them, instead.

So while cursor.fetchmany(100) will work, you can see from the MySQLdb source code that it doesn't fetch lazily from the server; it fetches everything into one big list, then returns a slice when you call fetchmany.

What psycopg2 supports is true streaming, where the results are remembered persistently on the server, and your Python process only sees a few of them at a time.

You can still use yield_per with MySQLdb, or any other DBAPI; that's the whole point of the DBAPI's design. You'll have to pay the memory cost for all the raw rows hidden away in the DBAPI (which are tuples, fairly cheap), but you won't also have to pay for all of the ORM objects at the same time.

Upvotes: 17

Related Questions