user3916429
user3916429

Reputation: 572

MySQL Partitioning (innoDB) - Large table

I have a MySQL very large database (1 billion rows) like this:

database : products("name","caracteristics")

Both columns are VARCHAR(50).

actually, it have no KEY sat, but "name" will be unique, so I think I will alter it as "name" PRIMARY_KEY. (I should have done that before.. now I need to perform a remove duplicate query before adding primary_key option I guess)

My problem is, when performing a simple query on the table, it takes ages literally.

SELECT caracteristics WHERE name=blabla LIMIT 1; //takes ages.

I was thinking of partitioning the existing table.

So here are the question:

delete a
from products a
left join(
select max(name) maxname, caracteristics
from products
group by caracteristics) b
on a.name = maxname and
a.caracteristics= b.caracteristics
where b.maxname IS NULL;

Upvotes: 3

Views: 4567

Answers (3)

Lajos Arpad
Lajos Arpad

Reputation: 76414

Yes, it is a good idea to fix performance issues. That is the correct answer always when you have performance issues serious-enough to be wondering about performance fixes.

You can achieve that by altering the table and making name a primary key, as you have already realized.

Your query should not be necessary. You should create a temporary table instead where you would insert the values you deem necessary. Let's suppose the name of that table is mytemptable. Then:

insert into mytemptable(name, characteristics)
select name, characteristics
from products
where not exists (select 1
                  from mytemptable t
                  where products.name = t.name);

Then remove your records from products using

delete from products;

then alter products, make sure it has name as a primary key and then

insert into products(name, characteristics)
select name, characteristics
from mytemptable;

and finally drop your temporary table.

As about your query:

Since you remove records, max(name) will be equal to all other names in your group if you have a single possible name associated to a given characteristics value, which is pretty safe to assume.. So, if you have a possible characteristics value matching a single name, you will remove all instances of that name, so yes, your query will mess your data.

Upvotes: 0

Bernd Buffen
Bernd Buffen

Reputation: 15057

you can also direct set a PRIMARY KEY with the ignore option like this:

ALTER IGNORE TABLE `products` ADD PRIMARY KEY(name);

this will delete all duplicates from name.

sample

MariaDB [l]> CREATE TABLE `products` (
    ->   `name` varchar(50) NOT NULL DEFAULT '',
    ->   `caracteristics` varchar(50) DEFAULT NULL
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)

MariaDB [l]> INSERT INTO `products` (`name`, `caracteristics`)
    -> VALUES
    ->     ('val1', 'asdfasdfasdf'),
    ->     ('val2', 'asdasDasd'),
    ->     ('val3', 'aesfawfa'),
    ->     ('val1', '99999999');
Query OK, 4 rows affected (0.01 sec)
Records: 4  Duplicates: 0  Warnings: 0

MariaDB [l]> select * from products;
+------+----------------+
| name | caracteristics |
+------+----------------+
| val1 | asdfasdfasdf   |
| val2 | asdasDasd      |
| val3 | aesfawfa       |
| val1 | 99999999       |
+------+----------------+
4 rows in set (0.00 sec)

MariaDB [l]> ALTER IGNORE TABLE `products` ADD PRIMARY KEY(name);
Query OK, 4 rows affected (0.03 sec)
Records: 4  Duplicates: 1  Warnings: 0

MariaDB [l]> select * from products;
+------+----------------+
| name | caracteristics |
+------+----------------+
| val1 | asdfasdfasdf   |
| val2 | asdasDasd      |
| val3 | aesfawfa       |
+------+----------------+
3 rows in set (0.00 sec)

MariaDB [l]>

test ADD PRIMARY KEY / INSERT IGNORE

Here is a test between add Primary key and insert ignore into. and you can see that add Primary key (90 sec / 120 sec) is a little bit faster in this sample

MariaDB [l]> CREATE TABLE `bigtable10m` (
    ->   `id` varchar(32) NOT NULL DEFAULT ''
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)

MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable10m`
    -> select lpad(seq,8,'0') from seq_1_to_10000000;
Query OK, 10000000 rows affected (24.24 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]>
MariaDB [l]> SELECT * FROM `bigtable10m` LIMIT 10;
+----------+
| id       |
+----------+
| 00000001 |
| 00000002 |
| 00000003 |
| 00000004 |
| 00000005 |
| 00000006 |
| 00000007 |
| 00000008 |
| 00000009 |
| 00000010 |
+----------+
10 rows in set (0.00 sec)

MariaDB [l]>
MariaDB [l]> CREATE TABLE `bigtable30m` (
    ->   `id` varchar(32) NOT NULL DEFAULT ''
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.02 sec)

MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (28.49 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (29.01 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (32.98 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]>
MariaDB [l]> ALTER IGNORE TABLE `bigtable30m` ADD PRIMARY KEY(id);
Query OK, 30000000 rows affected (1 min 32.34 sec)
Records: 30000000  Duplicates: 20000000  Warnings: 0

MariaDB [l]>
MariaDB [l]> DROP TABLE `bigtable30m`;
Query OK, 0 rows affected (0.52 sec)

MariaDB [l]>
MariaDB [l]> CREATE TABLE `bigtable30m` (
    ->   `id` varchar(32) NOT NULL DEFAULT ''
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.03 sec)

MariaDB [l]>
MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (37.29 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (41.87 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]> INSERT INTO `bigtable30m` SELECT * FROM `bigtable10m`;
Query OK, 10000000 rows affected (30.87 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [l]>
MariaDB [l]> CREATE TABLE bigtable_unique (
    ->   `id` varchar(32) NOT NULL DEFAULT '',
    ->  PRIMARY KEY (id)
    -> );
Query OK, 0 rows affected (0.02 sec)

MariaDB [l]>
MariaDB [l]> INSERT IGNORE bigtable_unique SELECT * FROM `bigtable30m`;
Query OK, 10000000 rows affected, 65535 warnings (1 min 57.99 sec)
Records: 30000000  Duplicates: 20000000  Warnings: 20000000

MariaDB [l]>

Upvotes: 3

Sebas
Sebas

Reputation: 21522

I think partitionning is not the way you should go for this particular problem. How would you partition? On what criteria?

I think your main concern is architectural and should be fixed prior to anything else: unique records are not unique.

Because of the volumetry I think any solution will take a while to execute. But my bet is that this one is the fastest:

CREATE TABLE products_unique (
 name VARCHAR(50) NOT NULL,
 characteristics VARCHAR(50),
 PRIMARY KEY (name)
);

INSERT IGNORE INTO products_unique SELECT * FROM products;

RENAME TABLE products TO products_backup;
RENAME TABLE products_unique TO products;

Duplicates will be evinced arbitrarily, but I think it is what you are looking for anyway. If it takes too long, you should try running it overnight... I just hope the transaction buffer does not explode on you in which case we'd have to work on some stored procedure to separate the inserts in batches.

Upvotes: 3

Related Questions