Reputation: 125
I have a use case where I am running a python script constantly updating a specific column on a table in a database. The script is like the following:
while True:
events = load_events()
for event in events:
team_1 = event.team_1
team_2 = event.team_2
dd = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
begin = dd - timedelta(hours=4)
end = dd + timedelta(hours=4)
db_event = session.query(SportEvent).filter(and_(
SportEvent.home_team.contains(event.team_1),
SportEvent.away_team.contains(event.team_2),
SportEvent.game_time.between(begin, end))).first()
if db_event.e_id != event.event_id:
setattr(db_event, 'e_id', event.event_id)
session.commit()
My problem is that each loop takes far longer than is acceptable because I am updating 100k events. I want to implement some sort of snapshot or cache at the start of each loop so each query to SportEvent
doesn't actually produce a query on the database, but rather just check against the cache. (I understand the risks of this and they are acceptable for my use case in how this table is updated/accessed as a whole). How can I implement something like this?
class SportEvent(Base):
__tablename__ = 'sport_events'
id = Column(String, primary_key=True)
sport_id = Column(String)
league_id = Column(String)
away_team = Column(String)
home_team = Column(String)
game_time = Column(DateTime)
dk_id = Column(String)
fd_id = Column(String)
cs_id = Column(String)
bm_id = Column(String)
def __init__(self, home_team=None,
away_team=None,
game_time=None, sport_id=None, league_id=None):
self.id = uuid.uuid4()
self.home_team = home_team
self.away_team = away_team
self.game_time = game_time
self.sport_id = sport_id
self.league_id = league_id
self.dk_id = ""
self.fd_id = ""
self.cs_id = ""
self.bm_id = ""
Upvotes: 2
Views: 185
Reputation: 23624
So what you're experiencing here is the cost per round trip from your service to your database. Depending on how many records need to be returned there are some options.
Note, without a deeper understanding of the data provided, these queries might not be optimal.
The most obvious one is doing bulk queries. Say that the time of updates are all around the same time. We could do something like this:
begin = None
end = None
for event in events:
gametime = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
if not begin or gametime < begin:
begin = gametime
if not end or gametime > end:
end = gametime
begin = begin - timedelta(hours=4)
end = end + timedelta(hours=4)
records = session.query(SportEvent).filter(
SportEvent.game_time.between(begin, end)
).all()
Now when looping you can consult the data you already queried.
for event in events:
team_1 = event.team_1
team_2 = event.team_2
dd = datetime.strptime(event.gametime, "%Y-%m-%dT%H:%M:%S.000Z")
begin = dd - timedelta(hours=4)
end = dd + timedelta(hours=4)
db_event = next(
(
record
for record in records
if (
team_1 in record.home_team
and team_2 in record.away_team
and begin < record.game_time < end
)
),
None
)
if db_event and db_event.e_id != event.event_id:
db_event.e_id = event.event_id
session.commit()
Again, a better answer requires knowing more about the event data and how it can be optimized to limit the results returned. Say there are only specific teams that are played- you could do one bulk query and a loop for each team.
Basically you need to look at the commonalities in your data and attempt to do the least amount of calls to the database as possible.
Also note, most DBs have a maximum limit on the number of parameters in the query, in the case of PostgresQL that's 64,000 which is below your total query requirements.
Upvotes: 1