Reputation: 49198
I'm trying to optimize some Python code. The profiler tells me that SQLAlchemy's _get_col() is what's killing performance. The code looks something like this:
lots_of_rows = get_lots_of_rows()
for row in lots_of_rows:
if row.x == row.y:
print row.z
I was about to go through the code and make it more like this...
lots_of_rows = get_lots_of_rows()
for row in lots_of_rows:
if row[0] == row[1]:
print row[2]
...but I've found some documentation that seems to indicate that when accessing row objects like arrays, you're actually still pulling dictionary keys. In other words, the row object looks like this:
'x': (x object)
'0': (x object)
'y': (y object)
'1': (y object)
'z': (z object)
'2': (z object)
If that's the case, I doubt I'll see any performance improvement from accessing columns by number rather than name. Is there any way to get SA to return results as a list of tuples, or a list of lists, rather than a list of dictionaries? Alternately, can anyone suggest any other optimizations?
Upvotes: 3
Views: 2981
Reputation: 54882
SQLAlchemy proxies all access to the underlying database cursor to map named keys to positions in the row tuple and perform any necessary type conversions. The underlying implementation is quite heavily optimized, caching almost everything. Looking over the disassembly the only ways to further optimize seem to be to throw out extensibility and get rid of a couple of attribute lookups or to resort to dynamic code generation for smaller gains, or to gain more, implement the corresponding ResultProxy and RowProxy classes in C.
Some quick profiling shows that the overhead is around 5us per lookup on my laptop. That will be significant if only trivial processing is done with the data. In those kind of cases it might be reasonable to drop down to dbapi level. This doesn't mean that you have to lose the query building functionality of SQLAlchemy. Just execute the statement as you usually would and get the dbapi cursor from the ResultProxy
by accessing result.cursor.cursor
. (result.cursor
is an SQLAlchemy CursorFairy object) Then you can use the regular dbapi fetchall(), fetchone() and fetchmany() methods.
But if you really are doing trivial processing it might be useful to do it, or at least the filtering part on the database server. You probably lose database portability, but that might not be an issue.
Upvotes: 1
Reputation:
You should post your profiler results as well as stack traces around the '_get_col' call so we know which _get_col is being called. (and whether _get_col really is the bottleneck).
I looked at the sqlalchemy source, looks like it may be calling 'lookup_key' (in engine/base.py) each time and it looks like this caches the column value locally, i guess lazily (via PopulateDict).
You can try bypassing that by directly using row.__props (not recommended since it's private), maybe you can row.cursor, but it looks like you would gain much by bypassing sqlalchemy (except the sql generation) and working directly w/ a cursor.
-- J
Upvotes: 0
Reputation: 38189
I think row.items()
is what you're looking for. It returns a list of (key, value) tuples for the row.
Upvotes: 1
Reputation: 41633
Forgive the obvious answer, but why isn't row.x == row.y in your query? For example:
mytable.select().where(mytable.c.x==mytable.c.y)
Should give you a huge performance boost. Read the rest of the documentation.
Upvotes: 2