Reputation: 2231
We have a Django project with the following problem:
User
objects are stored in the auth_user
table
These have a One-to-One relation to the Profile
object, this object is stored in the yuza_profile
table. Its relation to the user
is maintained through the user_id
field
However, years ago thousands of Profile
objects were deleted without removing the linked User
objects.
These users without profiles are causing problems in our system and we would like to remove them.
I created the following SQL query to perform this task
DELETE FROM auth_user WHERE id NOT IN (SELECT user_id FROM yuza_profile);
Since the Profile
objects have been deleted I cannot scan for null
values on *_id
fields - which would be my first move here.
Instead I am looping over two tables that both contain more than 60.000 rows which leads to a slow and inefficient query with more than 6 billion operations (and server timeouts)
Is there a way to speed up this query? I am aware that my query is very inefficient but I do not know of a way to improve it and any help would be greatly appreciated.
EDIT: As requested I've included the schemas below:
user_profile
id - integer
user_id - integer
gender - varchar(2)
birth_date - date
address - varchar(255)
city - varchar(255)
phone_number - varchar(10)
avatar - varchar(255)
---------------------
user_profile_pkey - (id)
user_profile_user_id_key - (user_id)
user_profile_user_id_fkey - (user_id) -> auth_user(id)
user_profile_pkey - (id) UNIQUE
user_profile_user_id_key - (user_id) UNIQUE
auth_user
id - integer
username - varchar(150)
first_name - varchar(30)
last_name - varchar(30)
email - varchar(75)
password - varchar(128)
is_staff - boolean
is_active - boolean
is_superuser - boolean
last_login - timestamp with time zone
date_joined - timestamp with time zone
---------------
auth_user_pkey - (id)
auth_user_username_key - (username)
auth_user_pkey - (id) UNIQUE
auth_user_username_key - (username) UNIQUE
I should note that I've started to wonder if the IDE (PycharmPro) I am using might be a factor - when testing my queries with the SELECT
statement the IDE displays queryresults as pages with 500 results per page. The initial query indeed took less than a second - but pressing the 'go to last page' button took more than 2 minutes (for just 16000 results)
Upvotes: 0
Views: 66
Reputation: 1270301
I strongly advise you to use NOT EXISTS
rather than NOT IN
. It treats NULL
values more intuitively. (NOT IN
will filter out all rows if any value in the subquery is NULL
.)
So, write the query as:
DELETE au FROM auth_user au
WHERE NOT EXISTS (SELECT 1
FROM yuza_profile vp
WHERE vp.user_id = au.id
);
This query can take advantage of an index on yuza_profile(user_id)
:
CREATE INDEX idx_yuza_profile_user_id ON yuza_profile(user_id);
Upvotes: 1
Reputation: 651
I'm not completely sure about your approach, I'm assuming you took a cursor-based strategy, but can't you just:
DELETE FROM auth_user WHERE id NOT IN (SELECT user_id FROM yuza_profile);
If you have the proper indices, this DML instruction will not require that much operations.
Upvotes: 1