Reputation: 15
I have two tables in Postgres, emails which have 10.000.000 rows and spam_email which have 150.000 rows. emails tables is consisted by normal email address like '[email protected]', '[email protected]',etc and the spam tables is consisted with address like '@domain.com','@domain1.com' etc.
What I want to do is find all the values from emails table that match spam_email table. What I did is this code:
Insert INTO clean_email(email)
select distinct email
from email e
join (select email from spam_email) se ON se.email not like s.email
But every time I get an error disk out of space
.
Do you have any other solution to do this? Thanks
Upvotes: 0
Views: 87
Reputation: 12562
You are trying to match a record in èmail
with every record that is not like
email, which would produce approximately 10000000 x 150000 records, making you run out of space.
I think the query below would be much more closer to what you would like to accomplish, assuming that spam_email
stores records such as '@domain1.com'
.
insert into clear_email(email)
select distinct
e.email
from email e
inner join spam_email se on extract_domain_from_mail(e.email) = se.email
Upvotes: 0
Reputation: 57421
Insert INTO clean_email(email)
select distinct email
from email e
left join spam_email se ON se.email=e.email
where se.email is null;
Upvotes: 0
Reputation: 1635
If you're running out of disk space then maybe you need more disk space; or maybe you need to configure postgresql to make better use of available resources : https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server .
Having said that, there is no need to join to a subselect and possibly that is using resources for a temporary table or sort which could be avoided.
Your query cannot possibly run as written (there is no table called s
, so s.email
can't be right).
Try this:
SELECT DISTINCT se.email FROM email e JOIN spam_email se ON e.email NOT LIKE '%' || se.email;
It will still be a slow query plan; there is no way to do a query like like '%' || se.email
using an index, so it's going to do a lot of table scanning. Better would be to separate the domain part into another column and put an index on it (or use a function for it and put an index on the function).
Upvotes: 1