seaders
seaders

Reputation: 4096

With SQLAlchemy, how can I convert a row to a "real" Python object?

I've been using SQLAlchemy with Alembic to simplify the database access I use, and any data structure changes I make to the tables. This has been working out really well up until I started to notice more and more issues with SQLAlchemy "expiring" fields from my point of view nearly at random.

A case in point would be this snippet,

class HRDecimal(Model):
    dec_id = Column(String(50), index=True)

    @staticmethod
    def qfilter(*filters):
        """
        :rtype : list[HRDecimal]
        """
        return list(HRDecimal.query.filter(*filters))


class Meta(Model):
    dec_id = Column(String(50), index=True)

    @staticmethod
    def qfilter(*filters):
        """
        :rtype : list[Meta]
        """
        return list(Meta.query.filter(*filters))

Code:

ids = ['1', '2', '3']  # obviously fake list of ids

decs = HRDecimal.qfilter(
    HRDecimal.dec_id.in_(ids))
metas = Meta.qfilter(
    Meta.dec_id.in_(ids))
combined = []

for ident in ids:
    combined.append((
        ident,
        [dec for dec in decs if dec.dec_id == ident],
        [hm for hm in metas if hm.dec_id == ident]
    ))

For the above, there wasn't a problem, but when I'm processing a list of ids that may contain a few thousand ids, this process started taking a huge amount of time, and if done from a web request in flask, the thread would often be killed.

When I started poking around with why this was happening, the key area was

        [dec for dec in decs if dec.dec_id == ident],
        [hm for hm in metas if hm.dec_id == ident]

At some point during the combining of these (what I thought were) Python objects, at some point calling dec.dec_id and hm.dec_id, in the SQLAlchemy code, at best, we go into,

def __get__(self, instance, owner):
    if instance is None:
        return self

    dict_ = instance_dict(instance)
    if self._supports_population and self.key in dict_:
        return dict_[self.key]
    else:
        return self.impl.get(instance_state(instance), dict_)

Of InstrumentedAttribute in sqlalchemy/orm/attributes.py which seems to be very slow, but even worse than this, I've observed times when fields expired, and then we enter,

def get(self, state, dict_, passive=PASSIVE_OFF):
    """Retrieve a value from the given object.
    If a callable is assembled on this object's attribute, and
    passive is False, the callable will be executed and the
    resulting value will be set as the new value for this attribute.
    """
    if self.key in dict_:
        return dict_[self.key]
    else:
        # if history present, don't load
        key = self.key
        if key not in state.committed_state or \
                state.committed_state[key] is NEVER_SET:
            if not passive & CALLABLES_OK:
                return PASSIVE_NO_RESULT

            if key in state.expired_attributes:
                value = state._load_expired(state, passive)

Of AttributeImpl in the same file. Horrible issue here is that state._load_expired re-runs the SQL Query completely. So in a situation like this, with a big list of idents, we end up running thousands of "small" SQL queries to the database, where I think we should have only been running two "large" ones at the top.

Now, I've gotten around the expired issue by how I initialise the database for flask with session-options, changing

app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(app)

to

app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(
    app,
    session_options=dict(autoflush=False, autocommit=False, expire_on_commit=False))

This has definitely improved the above situation for when a rows fields just seemed to expire seemingly (from my observations) at random, but the "normal" slowness of accessing items to SQLAlchemy is still an issue for what we're currently running.

Is there any way with SQLAlchemy, to get a "real" Python object returned from a query, instead of a proxied one like it is now, so it isn't being affected by this?

Upvotes: 4

Views: 3043

Answers (1)

Ilja Everilä
Ilja Everilä

Reputation: 52939

Your randomness is probably related to either explicitly committing or rolling back at an inconvenient time, or due to auto-commit of some kind. In its default configuration SQLAlchemy session expires all ORM-managed state when a transaction ends. This is usually a good thing, since when a transaction ends you've no idea what the current state of the DB is. This can be disabled, as you've done with expire_on_commit=False.

The ORM is also ill suited for extremely large bulk operations in general, as explained here. It is very well suited for handling complex object graphs and persisting those to a relational database with much less effort on your part, as it organizes the required inserts etc. for you. An important part of that is tracking changes to instance attributes. The SQLAlchemy Core is better suited for bulk.

It looks like you're performing 2 queries that produce a potentially large amount of results and then do a manual "group by" on the data, but in a rather unperforming way, because for each id you have you scan the entire list of results, or O(nm), where n is the number of ids and m the results. Instead you should group the results to lists of objects by id first and then perform the "join". On some other database systems you could handle the grouping in SQL directly, but alas MySQL has no notion of arrays, other than JSON.

A possibly more performant version of your grouping could be for example:

from itertools import groupby
from operator import attrgetter

ids = ['1', '2', '3']  # obviously fake list of ids

# Order the results by `dec_id` for Python itertools.groupby. Cannot
# use your `qfilter()` method as it produces lists, not queries.
decs = HRDecimal.query.\
    filter(HRDecimal.dec_id.in_(ids)).\
    order_by(HRDecimal.dec_id).\
    all()

metas = Meta.query.\
    filter(Meta.dec_id.in_(ids)).\
    order_by(Meta.dec_id).\
    all()

key = attrgetter('dec_id')
decs_lookup = {dec_id: list(g) for dec_id, g in groupby(decs, key)}
metas_lookup = {dec_id: list(g) for dec_id, g in groupby(metas, key)}

combined = [(ident,
             decs_lookup.get(ident, []),
             metas_lookup.get(ident, []))
            for ident in ids]

Note that since in this version we iterate over the queries only once, all() is not strictly necessary, but it should not hurt much either. The grouping could also be done without sorting in SQL with defaultdict(list):

from collections import defaultdict

decs = HRDecimal.query.filter(HRDecimal.dec_id.in_(ids)).all()
metas = Meta.query.filter(Meta.dec_id.in_(ids)).all()

decs_lookup = defaultdict(list)
metas_lookup = defaultdict(list)

for d in decs:
    decs_lookup[d.dec_id].append(d)

for m in metas:
    metas_lookup[m.dec_id].append(m)

combined = [(ident, decs_lookup[ident], metas_lookup[ident])
            for ident in ids]

And finally to answer your question, you can fetch "real" Python objects by querying for the Core table instead of the ORM entity:

decs = HRDecimal.query.\
    filter(HRDecimal.dec_id.in_(ids)).\
    with_entities(HRDecimal.__table__).\
    all()

which will result in a list of namedtuple like objects that can easily be converted to dict with _asdict().

Upvotes: 5

Related Questions