UgoL
UgoL

Reputation: 919

Optimize and speed up MySQL query selection

I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.

I have 2 MySQL tables with a relationship one-to-many. One is the user table that contains the unique list of users and It has around 22krows. One is the linedata table which contains all the possible coordinates for each user and it has around 490k rows.

In this case we can assume the foreign key between the 2 tables is the id value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.

The CREATE STMT structure

CREATE TABLE `user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `isActive` tinyint(4) NOT NULL,
  `userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
  `age` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;


CREATE TABLE `linedata` (
  `id` int(11) NOT NULL,
  `userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `timestamp` datetime NOT NULL,
  `x` float NOT NULL,
  `y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

The selection query

SELECT 
        u.id, 
        u.isActive, 
        u.userId,
        u.name,
        u.gender,
        u.age,
        GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
                     ORDER BY timestamp ASC SEPARATOR '; '
                    ) as linedata_0

        FROM user u 
        JOIN linedata l
        ON u.id=l.id
        WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
          AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
        GROUP BY userId;

The EXPLAIN output

+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
    |   ID  |   SELECT_TYPE |   TABLE   |   TYPE    |   POSSIBLE_KEYS   |   KEY     |   KEY_LEN     |   REF     |   ROWS    |       EXTRA                                                |
    +-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
    |   1   |   SIMPLE      |   l      |   ALL   |   NULL         |   NULL |      NULL        |   NULL    |   491157   |   "Using where; Using temporary; Using filesort" |
    +-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
    |   1   |   SIMPLE      |   u      |   eq_ref  |   PRIMARY         |   PRIMARY |      4        |   l.id   |   1       |     NULL                                                   |
    +-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+

The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?

UPDATE

After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED:

id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra 1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL 1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition" 2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"

I have changed a bit some values of the tables:

linedata table user table

Where the id in user table can be joined with userId in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.

So, just for make a quick example we will have in user and in linedata table:

+-------+-----------+-----------+-------------------+--------+---+
|   id  | isActive  |   userId  |       name        | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
|   1   |   1       |  x4by4d   |   john            | m      | 22|
|   2   |   1       |  3ub3ub   |   bob             | m      | 50|
+-------+-----------+-----------+-------------------+--------+---+



+-------+-----------+-----------+------+---+
|   id  | userId    |timestamp  |  x   | y |
+-------+-----------+-----------+------+----+
|   1   |   1       | somedate  |  30  | 10 |
|   2   |   1       | somedate  |  45  | 15 |
|   3   |   1       | somedate  |  50  | 20 |
|   4   |   2       | somedate  |  20  |  5 |
|   5   |   2       | somedate  |  25  | 10 |
+-------+-----------+-----------+------+----+

I have added a compound index made of userId and timestamp values in linedata table.

Maybe instead of having as primary key an ai id value for linedata table, if I add a composite primary key made of userId+timestamp? Should increase the performance or maybe not?

Upvotes: 1

Views: 97

Answers (1)

Rick James
Rick James

Reputation: 142560

I need to help you fix several bugs before discussing performance.

First of all, '2018-02-28T20:00:00.000Z' won't work in MySQL. It needs to be '2018-02-28 20:00:00.000' and something needs to be done about the timezone.

Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...) cannot use any indexing on timestamp.

So, instead of

    WHERE  DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
      AND  DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0

do something like

    WHERE  l.timestamp >= '2018-02-28 20:00:00.000'
      AND  l.timestamp  < '2018-11-20 09:20:08.218'

I'm confused about the two tables. Both have id and userid, yet you join on id. Perhaps instead of

CREATE TABLE `linedata` (
  `id` int(11) NOT NULL,
  `userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  ...

you meant

CREATE TABLE `linedata` (
  `id` int(11) NOT NULL  AUTO_INCREMENT,  -- (the id for `linedata`)
  `userId` int NOT NULL,   -- to link to the other table
  ...
  PRIMARY KEY(id)
...

Then there could be several linedata rows for each user.

At that point, this

    JOIN  linedata l  ON u.id=l.id

becomes

    JOIN  linedata l  ON u.id=l.userid

Now, for performance: linedata needs INDEX(userid, timestamp) - in that order.

Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?

And GROUP_CONCAT has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len before running the query.

To make it work even faster, reformulate it this way:

SELECT  u.id, u.isActive, u.userId, u.name, u.gender, u.age,
        ( SELECT  GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
                      ORDER BY timestamp ASC
                      SEPARATOR '; ')
        ) as linedata_0
    FROM  user u
    JOIN  linedata l  ON u.id = l.userid
    WHERE  l.timestamp >= '2018-02-28 20:00:00.000'
      AND  l.timestamp  < '2018-11-20 09:20:08.218';

Another thing. You probably want to be able to look up a user by name; so add INDEX(name)

Oh, what the heck is the VARCHAR(255) for userID?? Ids are normally integers.

Upvotes: 2

Related Questions