kostas pats
kostas pats

Reputation: 15

sql join tales with like

I have two tables in Postgres, emails which have 10.000.000 rows and spam_email which have 150.000 rows. emails tables is consisted by normal email address like '[email protected]', '[email protected]',etc and the spam tables is consisted with address like '@domain.com','@domain1.com' etc.

What I want to do is find all the values from emails table that match spam_email table. What I did is this code:

Insert INTO clean_email(email) 
select distinct email 
from email e 
   join (select email from spam_email) se ON se.email not like s.email

But every time I get an error disk out of space.

Do you have any other solution to do this? Thanks

Upvotes: 0

Views: 87

Answers (3)

Erkan Haspulat
Erkan Haspulat

Reputation: 12562

You are trying to match a record in èmail with every record that is not like email, which would produce approximately 10000000 x 150000 records, making you run out of space.

I think the query below would be much more closer to what you would like to accomplish, assuming that spam_email stores records such as '@domain1.com'.

insert into clear_email(email)
select distinct
  e.email
from email e
inner join spam_email se on extract_domain_from_mail(e.email) = se.email

Upvotes: 0

StanislavL
StanislavL

Reputation: 57421

Insert INTO clean_email(email) 
select distinct email 
from email e 
   left join spam_email se ON se.email=e.email
where se.email is null;

Upvotes: 0

drquicksilver
drquicksilver

Reputation: 1635

If you're running out of disk space then maybe you need more disk space; or maybe you need to configure postgresql to make better use of available resources : https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server .

Having said that, there is no need to join to a subselect and possibly that is using resources for a temporary table or sort which could be avoided.

Your query cannot possibly run as written (there is no table called s, so s.email can't be right).

Try this:

SELECT DISTINCT se.email FROM email e 
  JOIN spam_email se ON e.email NOT LIKE '%' || se.email;

It will still be a slow query plan; there is no way to do a query like like '%' || se.email using an index, so it's going to do a lot of table scanning. Better would be to separate the domain part into another column and put an index on it (or use a function for it and put an index on the function).

Upvotes: 1

Related Questions