Reputation: 572
I have a MySQL very large database (1 billion rows) like this:
database : products("name","caracteristics")
Both columns are VARCHAR(50).
actually, it have no KEY sat, but "name" will be unique, so I think I will alter it as "name" PRIMARY_KEY. (I should have done that before.. now I need to perform a remove duplicate query before adding primary_key option I guess)
My problem is, when performing a simple query on the table, it takes ages literally.
SELECT caracteristics WHERE name=blabla LIMIT 1;
//takes ages.
I was thinking of partitioning the existing table.
So here are the question:
Is my idea of ALTER TABLE to set 'name' column as PRIMARY_KEY a good idea also?
also about the duplicate query, I found this around here, am I doing it properly? (don't want to mess up my table...)
delete a from products a left join( select max(name) maxname, caracteristics from products group by caracteristics) b on a.name = maxname and a.caracteristics= b.caracteristics where b.maxname IS NULL;
Upvotes: 3
Views: 4567
Reputation: 76414
Yes, it is a good idea to fix performance issues. That is the correct answer always when you have performance issues serious-enough to be wondering about performance fixes.
You can achieve that by altering the table
and making name
a primary key
, as you have already realized.
Your query should not be necessary. You should create a temporary table
instead where you would insert
the values you deem necessary. Let's suppose the name of that table is mytemptable
. Then:
insert into mytemptable(name, characteristics)
select name, characteristics
from products
where not exists (select 1
from mytemptable t
where products.name = t.name);
Then remove your records from products
using
delete from products;
then alter products
, make sure it has name
as a primary key
and then
insert into products(name, characteristics)
select name, characteristics
from mytemptable;
and finally drop
your temporary table.
As about your query:
Since you remove records, max(name)
will be equal to all other name
s in your group if you have a single possible name
associated to a given characteristics
value, which is pretty safe to assume.. So, if you have a possible characteristics
value matching a single name
, you will remove all instances of that name
, so yes, your query will mess your data.
Upvotes: 0
Reputation: 15057
you can also direct set a PRIMARY KEY with the ignore option like this:
ALTER IGNORE TABLE `products` ADD PRIMARY KEY(name);
this will delete all duplicates from name.
sample
MariaDB [l]> CREATE TABLE `products` (
-> `name` varchar(50) NOT NULL DEFAULT '',
-> `caracteristics` varchar(50) DEFAULT NULL
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)
MariaDB [l]> INSERT INTO `products` (`name`, `caracteristics`)
-> VALUES
-> ('val1', 'asdfasdfasdf'),
-> ('val2', 'asdasDasd'),
-> ('val3', 'aesfawfa'),
-> ('val1', '99999999');
Query OK, 4 rows affected (0.01 sec)
Records: 4 Duplicates: 0 Warnings: 0
MariaDB [l]> select * from products;
+------+----------------+
| name | caracteristics |
+------+----------------+
| val1 | asdfasdfasdf |
| val2 | asdasDasd |
| val3 | aesfawfa |
| val1 | 99999999 |
+------+----------------+
4 rows in set (0.00 sec)
MariaDB [l]> ALTER IGNORE TABLE `products` ADD PRIMARY KEY(name);
Query OK, 4 rows affected (0.03 sec)
Records: 4 Duplicates: 1 Warnings: 0
MariaDB [l]> select * from products;
+------+----------------+
| name | caracteristics |
+------+----------------+
| val1 | asdfasdfasdf |
| val2 | asdasDasd |
| val3 | aesfawfa |
+------+----------------+
3 rows in set (0.00 sec)
MariaDB [l]>
test ADD PRIMARY KEY / INSERT IGNORE
Here is a test between add Primary key and insert ignore into. and you can see that add Primary key (90 sec / 120 sec) is a little bit faster in this sample
MariaDB [l]> CREATE TABLE `bigtable10m` (
-> `id` varchar(32) NOT NULL DEFAULT ''
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)
MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable10m`
-> select lpad(seq,8,'0') from seq_1_to_10000000;
Query OK, 10000000 rows affected (24.24 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]>
MariaDB [l]> SELECT * FROM `bigtable10m` LIMIT 10;
+----------+
| id |
+----------+
| 00000001 |
| 00000002 |
| 00000003 |
| 00000004 |
| 00000005 |
| 00000006 |
| 00000007 |
| 00000008 |
| 00000009 |
| 00000010 |
+----------+
10 rows in set (0.00 sec)
MariaDB [l]>
MariaDB [l]> CREATE TABLE `bigtable30m` (
-> `id` varchar(32) NOT NULL DEFAULT ''
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)
MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (28.49 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (29.01 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (32.98 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]>
MariaDB [l]> ALTER IGNORE TABLE `bigtable30m` ADD PRIMARY KEY(id);
Query OK, 30000000 rows affected (1 min 32.34 sec)
Records: 30000000 Duplicates: 20000000 Warnings: 0
MariaDB [l]>
MariaDB [l]> DROP TABLE `bigtable30m`;
Query OK, 0 rows affected (0.52 sec)
MariaDB [l]>
MariaDB [l]> CREATE TABLE `bigtable30m` (
-> `id` varchar(32) NOT NULL DEFAULT ''
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.03 sec)
MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (37.29 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (41.87 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (30.87 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [l]>
MariaDB [l]> CREATE TABLE bigtable_unique (
-> `id` varchar(32) NOT NULL DEFAULT '',
-> PRIMARY KEY (id)
-> );
Query OK, 0 rows affected (0.02 sec)
MariaDB [l]>
MariaDB [l]> INSERT IGNORE bigtable_unique SELECT * FROM `bigtable30m`;
Query OK, 10000000 rows affected, 65535 warnings (1 min 57.99 sec)
Records: 30000000 Duplicates: 20000000 Warnings: 20000000
MariaDB [l]>
Upvotes: 3
Reputation: 21522
I think partitionning is not the way you should go for this particular problem. How would you partition? On what criteria?
I think your main concern is architectural and should be fixed prior to anything else: unique records are not unique.
Because of the volumetry I think any solution will take a while to execute. But my bet is that this one is the fastest:
CREATE TABLE products_unique (
name VARCHAR(50) NOT NULL,
characteristics VARCHAR(50),
PRIMARY KEY (name)
);
INSERT IGNORE INTO products_unique SELECT * FROM products;
RENAME TABLE products TO products_backup;
RENAME TABLE products_unique TO products;
Duplicates will be evinced arbitrarily, but I think it is what you are looking for anyway. If it takes too long, you should try running it overnight... I just hope the transaction buffer does not explode on you in which case we'd have to work on some stored procedure to separate the inserts in batches.
Upvotes: 3