Saffik
Saffik

Reputation: 1003

Efficient way to delete a large amount of records from a big table using python

I have a large table (About 10 million rows) that I need to delete records that are "older" than 10 days (according to created_at column). I have a python script that I run to do this. created_at is a varchar(255) and has values like for e.g. 1594267202000

import mysql.connector
import sys
from mysql.connector import Error

table = sys.argv[1]
deleteDays = sys.argv[2]

sql_select_query = """SELECT COUNT(*) FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))""".format(table)
sql_delete_query = """DELETE FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY)) LIMIT 100""".format(table)

try:
    connection = mysql.connector.connect(host=localhost,
                                         database=myDatabase,
                                         user=admin123,
                                         password=password123)
    cursor = connection.cursor()

        
    #initial count of rows before deletion
    cursor.execute(sql_select_query, (deleteDays,))
    records = cursor.fetchone()[0]


    while records >= 1:
        # stuck at following below line and time out happens....
        cursor.execute(sql_delete_query, (deleteDays,))
        connection.commit()
        cursor.execute(sql_select_query, (deleteDays,))
        records = cursor.fetchone()[0]

    #final count of rows after deletion
    cursor.execute(sql_select_query, (deleteDays,))
    records = cursor.fetchone()[0]

    if records == 0:
        print("\nRows deleted")
    else:
        print("\nRows NOT deleted")

except mysql.connector.Error as error:
    print("Failed to delete: {}".format(error))
finally:
    if (connection.is_connected()):
        cursor.close()
        connection.close()
        print("MySQL connection is closed")

When I run this script and it runs the DELETE QUERY however... it fails due to:

Failed to delete: 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

I know that the innodb_lock_wait_timeout is currently set to 50 seconds and I can increase it to overcome this problem, however i'd rather not touch the timeout and.... I want to basically delete in chunks maybe? Anyone know how I can do it here using my code as example?

Upvotes: 0

Views: 1928

Answers (2)

Schwern
Schwern

Reputation: 164799

created_at has no index and is a varchar(255)Saffik 11 hours ago

There's your problem. Two of them.

It needs to be indexed to have any hope of being performant. Without an index, MySQL has to check every record in the table. With an index, it can skip straight to the ones which match.

While storing an integer as a varchar will work, MySQL will convert it for you, it's bad practice; it wastes storage, allows bad data, and is slow.

Change created_at to a bigint so that it's stored as a number, then index it.

alter table your_table modify column created_at bigint;
create index created_at_idx on your_table(created_at);

Now that created_at is an indexed bigint, your query should use the index and it should be very fast.


Note that created_at should be a datetime which stores the time at microsecond accuracy. Then you can use MySQL's date functions without having to convert.

But that's going to mess with your code which expects a millisecond epoch number, so you're stuck with it. Keep it in mind for future tables.

For this table, you can add a generated created_at_datetime column to make working with dates easier. And, of course, index it.

alter table your_table add column created_at_datetime datetime generated always as (from_unixtime(created_at/1000));
create index created_at_datetime on your_table(created_at_datetime);

Then your where clause becomes much simpler.

WHERE created_at_datetime < DATE_SUB(NOW(), INTERVAL %s DAY)

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521178

One approach here might be to use a delete limit query, to batch your deletes at a certain size. Assuming batches of 100 records:

DELETE
FROM yourTable
WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))
LIMIT 100;

Note that strictly speaking you should always have an ORDER BY clause when using LIMIT. What I wrote above might delete any 100 records matching the criteria for deletion.

Upvotes: 0

Related Questions