Kristián Filo
Kristián Filo

Reputation: 865

Removing rows based on duplicate columns in MySQL

I've come across some similar questions, but I still don't quite understand.

I have a MySQL database with a table in which I store data of people and unique codes. The base unit for me is the e-mail address. I want to select random records, but limit them so a specific email column can be selected only once. This is my table structure (I'm leaving some columns, that are not relevant to this question).

+-----+-------------------+---------+----------+----------+
| ID  | email             | name    | lastname | code     |
+-----+-------------------+---------+----------+----------+
|  1  | [email protected]    | Simon   | Hardy    | 123ABC   |
|  2  | [email protected]    | John    | Doe      | EEEEEE   |
|  3  | [email protected]    | John    | Doe      | AEAEAE   |
|  4  | [email protected]      | Bill    | Liebe    | 5D78AC   |
|  5  | [email protected]   | Ellen   | Petete   | 99AQE5   |
|  6  | [email protected]    | John    | Doe      | 000CVV   |
|  7  | [email protected]   | Peter   | Lorem    | 54ACSS   |
|  8  | [email protected]    | Emma    | Stone    | 98WW7Q   |
+-----+-------------------+---------+----------+----------+

If I limit my selection to 3 rows and somehow the row with email = [email protected] got selected, I need the other two rows with this email to be ignored/skipped. This is my query now:

SELECT * FROM people ORDER BY RAND() LIMIT 3

PS: I know "ORDER BY RAND()" is slow, I just didn't focuse on that part yet.

I was thinking about GROUP BY, but as far as I understand that way I would only get that one column, I need to fetch all of them.

Is there a straightforward solution for this in MySQL?

Upvotes: 1

Views: 71

Answers (2)

Nick Jang
Nick Jang

Reputation: 1

EDIT**

SELECT * FROM Test WHERE id IN (SELECT MIN(id) FROM Test GROUP BY email) LIMIT 3;

this should do.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269463

In MySQL 8+, you can use:

SELECT p.*
FROM people p
ORDER BY ROW_NUMBER() OVER (PARTITION BY email ORDER BY RAND())
LIMIT 3;

If you want to solve both the performance problem and the duplication problem at the same time . . . that is challenging. My recommendation is to select a smaller number of rows and just "hope" that there are enough different emails.

For instance, for 3 emails you might want to get about 100 rows with something like this:

select p.*,
       (@rn := if(@e = email, @rn + 1,
                  if(@e := email, 1, 1)
                 )
       ) as rn
from (select p.*
      from people p cross join
           (select count(*) as cnt from people) pp  -- can use primary key index
      where rand() < (100 / cnt) -- get about 100 rows
      order by email, rand()  -- only on about 100 rows
     ) p cross join
     (select @e := '', @rn := 0) params
having rn = 1
limit 3;

Upvotes: 2

Related Questions