Reputation: 1306

Optimize mysql query involving millions of rows

I have, in a project, a database with two big tables, "terminosnoticia" have 400 Million rows and "noticia" 3 Million. I have one query I want to make lighter (it spend from 10s to 400s):

    SELECT noticia_id, termino_id
      FROM noticia 
      LEFT JOIN terminosnoticia on terminosnoticia.noticia_id=noticia.id AND termino_id IN (7818,12345) 
     WHERE noticia.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00' 
       AND noticia_id is not null AND termino_id is not null;`

The only viable solution I have to explore is to denormalize the database to include the 'fecha' field in the big table, but, this will multiply the index sizes.

Explain plan:

+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| id | select_type | table           | type   | possible_keys         | key        | key_len | ref                                     | rows  | Extra       |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
|  1 | SIMPLE      | terminosnoticia | ref    | noticia_id,termino_id | termino_id | 4       | const                                   | 58480 | Using where |
|  1 | SIMPLE      | noticia         | eq_ref | PRIMARY,fecha         | PRIMARY    | 4       | db_resumenes.terminosnoticia.noticia_id |     1 | Using where |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+

Changing the query and creating the index as suggested, the explain plan is now:

+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type   | possible_keys                             | key                 | key_len | ref                       | rows  | Extra       |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
|  1 | SIMPLE      | T     | ref    | noticia_id,termino_id,terminosnoticia_cpx | terminosnoticia_cpx | 4       | const                     | 60600 | Using index |
|  1 | SIMPLE      | N     | eq_ref | PRIMARY,fecha                             | PRIMARY             | 4       | db_resumenes.T.noticia_id |     1 | Using where |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+

But the execution time does not vary too much...

Any idea?

Upvotes: 0

Answers (4)

DRapp

Reputation: 48179

As Strawberry pointed out, by having an "AND" in your where clause for NOT NULL is the same as a regular INNER JOIN and can be reduced to.

 SELECT 
       N.id as noticia_id, 
       T.termino_id
   FROM 
      noticia N  USING INDEX (fecha)
         JOIN terminosnoticia T
            on N.id = T.noticia_id
            AND T.termino_id IN (7818,12345) 
   WHERE 
      N.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'

Now, that said and aliases applied, I would suggest the following covering indexes as

table           index
Noticia         ( fecha, id )
terminosnoticia ( noticia_id, termino_id )

This way the query can get all the results directly from the indexes and not have to go to the raw data pages to qualify the other fields.

Upvotes: 4

spencer7593

Reputation: 108530

We're assuming that the noticia_id and termino_id are columns in terminosnoticia table. (We wouldn't have to guess, if all of the column references were qualified with the table name or a short table alias.)

Why is this an outer join? The predicates in the WHERE clause are going to exclude rows with NULL values for columns from terminosnoticia. That's going to negate the "outerness" of the join.

And if we write this as an inner join, those predicates in the WHERE clause are redundant. We already know that noticia_id won't be NULL (if it satisfies the equality predicate in the ON clause). Same for termino_id, that won't be NULL if it's equal to a value in the IN list.

I believe this query will return an equivalent result:

  SELECT t.noticia_id
       , t.termino_id
    FROM noticia n
    JOIN terminosnoticia t
      ON t.noticia_id = n.id
     AND t.termino_id IN (7818,12345)
   WHERE n.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'

What's left now is figuring out if there's any implicit datatype conversions.

We don't see the datatype of termino_id. So we don't know if that's defined as numeric. It's bad news if it's not, since MySQL will have to perform a conversion to numeric, for every row in the table, so it can do the comparison to the numeric literals.

We don't see the datatypes of the noticia_id, and whether that matches the datatype of the column it's being compared to, the id column from noticia table.

We also don't see the datatype of fecha. Based on the string literals in the between predicate, it looks like it's probably a DATETIME or TIMESTAMP. But that's just a guess. We don't know, since we don't have a table definition available to us.

Once we have verified that there aren't any implicit datatype conversions that are going to bite us...

For the query with the inner join (as above), the best shot at reasonable performance will likely be with MySQL making effective use of covering indexes. (A covering index allows MySQL to satisfy the query directly from from the index blocks, without needing to lookup pages in the underlying table.)

As DRApp's answer already states, the best candidates for covering indexes, for this particular query, would be:

 ... ON noticia (fecha, id)
 ... ON terminosnoticia (noticia_id, termino_id)

An index that has those same leading columns in that same order would also be suitable, and would render these indexes redundant.

The addition of these indexes will render other indexes redundant.

The first index would be redundant with ... ON noticia (fecha). Assuming the index isn't enforcing a UNIQUE constraint, it could be dropped. Any query making effective use of that index could use the new index, since fecha is the leading column in the new index.

Similarly, an index ... ON terminosnoticia (noticia_id) would be redundant. Again, assuming it's not a unique index, enforcing a UNIQUE constraint, that index could be dropped as well.

Upvotes: 0

olegsv

Reputation: 1462

Try this:

SELECT tbl1.noticia_id, tbl1.termino_id FROM
( SELECT FROM terminosnoticia WHERE
terminosnoticia.termino_id IN (7818,12345) 
AND terminosnoticia.noticia_id is not null 
) tbl1 INNER JOIN 
( SELECT id FROM noticia 
  WHERE noticia.fecha
  BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
) tbl2 ON tbl1.id=tbl2.noticia.id

Upvotes: 0

Pablo Santa Cruz

Reputation: 181460

Assuming noticia_id is noticia's primary key, I would add the following indexes:

create index noticia_fecha_idx on noticia(fecha);
create index terminosnoticia_id_noticia_idx on terminosnoticia(noticia_id);

And try your queries again.

Do include the current execution plan of your query. It might help on helping you figuring this one out.

Upvotes: 1

Optimize mysql query involving millions of rows

Answers (4)

Related Questions