asdasd31
asdasd31

Reputation: 125

possible to cache a table in sqlalchemy to not query database in specific use case?

I have a use case where I am running a python script constantly updating a specific column on a table in a database. The script is like the following:

while True:
    events = load_events()
    for event in events:
        team_1 = event.team_1
        team_2 = event.team_2
        dd = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
        begin = dd - timedelta(hours=4)
        end = dd + timedelta(hours=4)

        db_event = session.query(SportEvent).filter(and_(
            SportEvent.home_team.contains(event.team_1),                                      
            SportEvent.away_team.contains(event.team_2),
            SportEvent.game_time.between(begin, end))).first()
        if db_event.e_id != event.event_id:
            setattr(db_event, 'e_id', event.event_id)
    session.commit()

My problem is that each loop takes far longer than is acceptable because I am updating 100k events. I want to implement some sort of snapshot or cache at the start of each loop so each query to SportEvent doesn't actually produce a query on the database, but rather just check against the cache. (I understand the risks of this and they are acceptable for my use case in how this table is updated/accessed as a whole). How can I implement something like this?

class SportEvent(Base):
    __tablename__ = 'sport_events'

    id = Column(String, primary_key=True)
    sport_id = Column(String)
    league_id = Column(String)
    away_team = Column(String)
    home_team = Column(String)
    game_time = Column(DateTime)
    dk_id = Column(String)
    fd_id = Column(String)
    cs_id = Column(String)
    bm_id = Column(String)

    def __init__(self, home_team=None,
                 away_team=None,
                 game_time=None, sport_id=None, league_id=None):
        self.id = uuid.uuid4()
        self.home_team = home_team
        self.away_team = away_team
        self.game_time = game_time
        self.sport_id = sport_id
        self.league_id = league_id
        self.dk_id = ""
        self.fd_id = ""
        self.cs_id = ""
        self.bm_id = ""

Upvotes: 2

Views: 185

Answers (1)

flakes
flakes

Reputation: 23624

So what you're experiencing here is the cost per round trip from your service to your database. Depending on how many records need to be returned there are some options.

Note, without a deeper understanding of the data provided, these queries might not be optimal.

The most obvious one is doing bulk queries. Say that the time of updates are all around the same time. We could do something like this:

begin = None
end = None
for event in events:
    gametime = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
    if not begin or gametime < begin:
        begin = gametime
    if not end or gametime > end:
        end = gametime 
begin = begin - timedelta(hours=4)
end = end + timedelta(hours=4)

records = session.query(SportEvent).filter(
    SportEvent.game_time.between(begin, end)
).all()

Now when looping you can consult the data you already queried.

for event in events:
    team_1 = event.team_1
    team_2 = event.team_2
    dd = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
    begin = dd - timedelta(hours=4)
    end = dd + timedelta(hours=4)

    db_event = next(
        (
            record
            for record in records
            if (
                team_1 in record.home_team
                and team_2 in record.away_team
                and begin < record.game_time < end
            )
        ),
        None
    )
    if db_event and db_event.e_id != event.event_id:
        db_event.e_id = event.event_id

session.commit()

Again, a better answer requires knowing more about the event data and how it can be optimized to limit the results returned. Say there are only specific teams that are played- you could do one bulk query and a loop for each team.

Basically you need to look at the commonalities in your data and attempt to do the least amount of calls to the database as possible.

Also note, most DBs have a maximum limit on the number of parameters in the query, in the case of PostgresQL that's 64,000 which is below your total query requirements.

Upvotes: 1

Related Questions