How to effectively synchronize freshly fetched data with data stored in database?

Question

Let's start with initialization of the database:

import sqlite3


entries = [
    {"name": "Persuasive", "location": "Bolivia"},
    {"name": "Crazy", "location": "Guyana"},
    {"name": "Humble", "location": "Mexico"},
    {"name": "Lucky", "location": "Uruguay"},
    {"name": "Jolly", "location": "Alaska"},
    {"name": "Mute", "location": "Uruguay"},
    {"name": "Happy", "location": "Chile"}
]

conn = sqlite3.connect('entries.db')
conn.execute('''DROP TABLE ENTRIES''')
conn.execute('''CREATE TABLE ENTRIES
(ID INT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
LOCATION TEXT NOT NULL,
ACTIVE NUMERIC NULL);''')

conn.executemany("""INSERT INTO ENTRIES (ID, NAME, LOCATION) VALUES (:id, :name, :location)""", entries)
conn.commit()

That was an initial run, just to populate the database with some data.

Then, everytime the application runs, new data gets fetched from somewhere:

findings = [
    {"name": "Brave", "location": "Bolivia"}, # new
    {"name": "Crazy", "location": "Guyana"},
    {"name": "Humble", "location": "Mexico"},
    {"name": "Shy", "location": "Suriname"}, # new
    {"name": "Cautious", "location": "Brazil"}, # new
    {"name": "Mute", "location": "Uruguay"},
    {"name": "Happy", "location": "Chile"}
]

In this case, we have 3 new items in the list. I expect that all the items that are in the database now will remain there, and the 3 new items will be appended to the db. And all items from the list above would get active flag set to True, remaining ones would get the flag set to False. Let's prepare a dump from the database:

conn = sqlite3.connect('entries.db')
cursor = conn.execute("SELECT * FROM ENTRIES ORDER BY ID")

db_entries = []
for row in cursor:
    entry = {"id": row[0], "name": row[1], "location": row[2], "active": row[3]}
    db_entries.append(entry)

OK, now we can compare what's in new findings, and what was there already in the database:

import random


for f in findings:
    n = next((d for d in db_entries if d["name"] == f["name"] and d["location"] == f["location"]), None)
    if n is None:
        id = int(random.random() * 10)
        conn.execute('''INSERT INTO ENTRIES(ID, NAME, LOCATION, ACTIVE) VALUES (?, ?, ?, ?)''',
                     (id, f["name"], f["location"], 1))
        conn.commit()
    else:
        active = next((d for d in db_entries if d['id'] == n['id']), None)
        active.update({"act": "yes"})
        conn.execute("UPDATE ENTRIES set ACTIVE = 1 where ID = ?", (n["id"],))
        conn.commit()

(I know you're probably upset with the random ID generator, but it's for prototyping purpose only)

As you saw, instances of db_entries that are common with findings instances were updated with a flag ({"act": "yes"}). They get processed now, and beside that the items that are no longer active get updated with a different flag and then queried for deactivation:

for d in db_entries:
    if "act" in d:
        conn.execute("UPDATE ENTRIES set ACTIVE = 1 where ID = ?", (d["id"],))
        conn.commit()
    else:
        if d["active"] == 1:
            d.update({"deact": "yes"})

for d in db_entries:
    if "deact" in d:
        conn.execute("UPDATE ENTRIES set ACTIVE = 0 where ID = ?", (d["id"],))
        conn.commit()

conn.close()

And this is it: items fetched on the fly were compared with those in the database and synchronized.

I have a feeling that this approach saves some data transfer between application and database, as it only updates items that require updating, but on the other hand it feels like the whole process could be rebuilt and made more effective.

What would you improve in this process?

hexerei software · Accepted Answer

Wouldn't a simpler approach be in just inserting all new data, but keeping track of duplicates by appending ON CONFLICT DO UPDATE SET.

You wouldn't even necessarily need the ID field, but you would need a unique key on NAME and LOCATION to identify duplicates. Then following query would identify the duplicate and not insert it, but just update the NAME field with the same value again (so basically same result as ignoring the row).

INSERT INTO ENTRIES (NAME, LOCATION)
VALUES ('Crazy', 'Guyana')
ON CONFLICT(NAME,LOCATION) DO UPDATE SET NAME = 'Crazy';

then you can simply execute:

conn.execute('''INSERT INTO ENTRIES(NAME, LOCATION) VALUES (?, ?) ON CONFLICT(NAME,LOCATION) DO UPDATE SET NAME=?''',
    (f["name"], f["location"], f["name"]))

This would simplify your "insert only new entries" process. I recon you could also combine this in such a way, that the update you perform is not updating the NAME field, but in fact add your ACTIVE logic here.

Also since SQLite version 3.24.0 it supports UPSERT

How to effectively synchronize freshly fetched data with data stored in database?

Answers (1)

Related Questions