Jasper
Jasper

Reputation: 2231

Speeding up SQL query with more than 6 billion operations

We have a Django project with the following problem:

User objects are stored in the auth_user table

These have a One-to-One relation to the Profile object, this object is stored in the yuza_profile table. Its relation to the user is maintained through the user_id field

However, years ago thousands of Profile objects were deleted without removing the linked User objects.

These users without profiles are causing problems in our system and we would like to remove them.

I created the following SQL query to perform this task

DELETE FROM auth_user WHERE id NOT IN (SELECT user_id FROM yuza_profile); 

Since the Profile objects have been deleted I cannot scan for null values on *_id fields - which would be my first move here.

Instead I am looping over two tables that both contain more than 60.000 rows which leads to a slow and inefficient query with more than 6 billion operations (and server timeouts)

Is there a way to speed up this query? I am aware that my query is very inefficient but I do not know of a way to improve it and any help would be greatly appreciated.

EDIT: As requested I've included the schemas below:

user_profile

id - integer
user_id - integer
gender - varchar(2)
birth_date - date
address - varchar(255)
city - varchar(255)
phone_number - varchar(10)
avatar - varchar(255)
---------------------
user_profile_pkey - (id)
user_profile_user_id_key - (user_id)
user_profile_user_id_fkey - (user_id) -> auth_user(id)
user_profile_pkey - (id) UNIQUE
user_profile_user_id_key - (user_id) UNIQUE
auth_user

id - integer
username - varchar(150)
first_name - varchar(30)
last_name - varchar(30)
email - varchar(75)
password - varchar(128)
is_staff - boolean
is_active - boolean
is_superuser - boolean
last_login - timestamp with time zone
date_joined - timestamp with time zone
---------------
auth_user_pkey - (id)
auth_user_username_key - (username)
auth_user_pkey - (id) UNIQUE
auth_user_username_key - (username) UNIQUE

I should note that I've started to wonder if the IDE (PycharmPro) I am using might be a factor - when testing my queries with the SELECT statement the IDE displays queryresults as pages with 500 results per page. The initial query indeed took less than a second - but pressing the 'go to last page' button took more than 2 minutes (for just 16000 results)

Upvotes: 0

Views: 66

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270301

I strongly advise you to use NOT EXISTS rather than NOT IN. It treats NULL values more intuitively. (NOT IN will filter out all rows if any value in the subquery is NULL.)

So, write the query as:

DELETE au FROM auth_user au
    WHERE NOT EXISTS (SELECT 1
                      FROM yuza_profile vp 
                      WHERE vp.user_id = au.id
                     ); 

This query can take advantage of an index on yuza_profile(user_id):

CREATE INDEX idx_yuza_profile_user_id ON yuza_profile(user_id);

Upvotes: 1

Gustavo Fonseca
Gustavo Fonseca

Reputation: 651

I'm not completely sure about your approach, I'm assuming you took a cursor-based strategy, but can't you just:

DELETE FROM auth_user WHERE id NOT IN (SELECT user_id FROM yuza_profile);

If you have the proper indices, this DML instruction will not require that much operations.

Upvotes: 1

Related Questions