Reputation: 4079
I have a quirky GCP cloud function that does, at sort of a high level, unintelligent website scraping, by taking advantage of the website accepting a numerically ascending 'id' parameter.
main.py
def upsert(db, i, x, y):
with db.connect() as cursor:
cursor.execute(f"REPLACE INTO TABLE_NAME (i, x, y) VALUES (i, \"{x}\", \"{y}\"")
def main(data, context):
db = sqlalchemy.connect(...)
for i in range(0, 100):
soup = BeautifulSoup(f"websiteimscraping.com/id={i}")
if soup is not valid:
continue
x = soup.find_all(...)
y = soup.find_all(...)
upsert(db, i, x, y)
In this example, I know for a fact that most if not all of the id's 0 thru 99 are valid + unique and I should have a very similar amount of rows in the database when the cloud function is complete.
To verify that I have 100 rows in MySQL, I enter debug mode and evaluate the expression:
list(cursor.execute("SELECT * FROM TABLE_NAME"))
and I get back like 20% of the rows i'd expect, without any semblance of order. with ids Like (7, 8, 9, 10, 13, 44, 45, 46, 47, 48, 49, 50, 51, 86)
I do some print statements to make sure none of my edge cases are triggered (which would result in skipping an invalid id) and indeed, none are in the code, which I verified by stepping through in debug mode and seeing all the id's sent to the upsert
step.
Something which I think is important to clarify is that the rows saved to the database each time are not randomly determined -> it's the same rows each time that are not inserted/updated.
A core assumption I'm making is that if there were to be a data issue for the rows that are not being inserted which caused some issue with the REPLACE
statement, sqlalchemy would raise an exception for me. sqlalchemy raised no exceptions throughout the entire function.
Is there anything not data related with regards to sqlalchemy that could be causing rows not to be inserted?
I have tried looking for patterns in the data for the rows that have been inserted to try to gauge some sort of data (in)consistency or patterns (which would seem like the likelier suspect) but haven't found any.
I wish it were simpler for me to give the code away for reproducing but I'm pretty sure no one wants to set up their own mysql instance to try this out - if you are, I am happy to give the actual code and you can try it yourself.
Upvotes: 2
Views: 114
Reputation: 1392
I have executed your code and I managed to reproduce your scenario, encountering the same problem that only some rows seem to be inserted.
The problem is that you are using a REPLACE statement. As stated in MySQL's documentation a REPLACE statement works exactly like an INSERT operation but if the PRIMARY KEY
already exists said row is replaced instead of failing the query.
As per your shared code it seems like you are using the field code
as PRIMARY KEY
but you are checking the field id
as an indicator of an inserted web page. What happens is that multiple rows have the same code
, thus every time one is inserted the previous one is removed.
I have solved your issue simply making the field id
the PRIMARY KEY
instead of code
. Once you change that (remember to drop the original table), you can run your code again and you will see no missing id
s. You can verify it using:
SELECT id FROM CLASSES_MASTER;
By the way, now that you have all the data inserted you can check the amount of repeated codes with this:
SELECT code, COUNT(*) FROM CLASSES_MASTER GROUP BY code;
Upvotes: 1