Reputation: 919
I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.
I have 2 MySQL tables with a relationship one-to-many. One is the user
table that contains the unique list of users and It has around 22krows. One is the linedata
table which contains all the possible coordinates for each user and it has around 490k rows.
In this case we can assume the foreign key between the 2 tables is the id
value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isActive` tinyint(4) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
`age` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`timestamp` datetime NOT NULL,
`x` float NOT NULL,
`y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
SELECT
u.id,
u.isActive,
u.userId,
u.name,
u.gender,
u.age,
GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
ORDER BY timestamp ASC SEPARATOR '; '
) as linedata_0
FROM user u
JOIN linedata l
ON u.id=l.id
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
GROUP BY userId;
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | l | ALL | NULL | NULL | NULL | NULL | 491157 | "Using where; Using temporary; Using filesort" |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | l.id | 1 | NULL |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?
After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED
:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra 1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL 1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition" 2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"
I have changed a bit some values of the tables:
Where the id
in user table can be joined with userId
in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.
So, just for make a quick example we will have in user
and in linedata
table:
+-------+-----------+-----------+-------------------+--------+---+
| id | isActive | userId | name | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
| 1 | 1 | x4by4d | john | m | 22|
| 2 | 1 | 3ub3ub | bob | m | 50|
+-------+-----------+-----------+-------------------+--------+---+
+-------+-----------+-----------+------+---+
| id | userId |timestamp | x | y |
+-------+-----------+-----------+------+----+
| 1 | 1 | somedate | 30 | 10 |
| 2 | 1 | somedate | 45 | 15 |
| 3 | 1 | somedate | 50 | 20 |
| 4 | 2 | somedate | 20 | 5 |
| 5 | 2 | somedate | 25 | 10 |
+-------+-----------+-----------+------+----+
I have added a compound index made of userId
and timestamp
values in linedata table.
Maybe instead of having as primary key an ai id value for linedata
table, if I add a composite primary key made of userId
+timestamp
? Should increase the performance or maybe not?
Upvotes: 1
Views: 97
Reputation: 142560
I need to help you fix several bugs before discussing performance.
First of all, '2018-02-28T20:00:00.000Z'
won't work in MySQL. It needs to be '2018-02-28 20:00:00.000'
and something needs to be done about the timezone.
Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...)
cannot use any indexing on timestamp
.
So, instead of
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
do something like
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218'
I'm confused about the two tables. Both have id
and userid
, yet you join on id
. Perhaps instead of
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
...
you meant
CREATE TABLE `linedata` (
`id` int(11) NOT NULL AUTO_INCREMENT, -- (the id for `linedata`)
`userId` int NOT NULL, -- to link to the other table
...
PRIMARY KEY(id)
...
Then there could be several linedata
rows for each user
.
At that point, this
JOIN linedata l ON u.id=l.id
becomes
JOIN linedata l ON u.id=l.userid
Now, for performance: linedata
needs INDEX(userid, timestamp)
- in that order.
Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?
And GROUP_CONCAT
has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len
before running the query.
To make it work even faster, reformulate it this way:
SELECT u.id, u.isActive, u.userId, u.name, u.gender, u.age,
( SELECT GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
ORDER BY timestamp ASC
SEPARATOR '; ')
) as linedata_0
FROM user u
JOIN linedata l ON u.id = l.userid
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218';
Another thing. You probably want to be able to look up a user by name
; so add INDEX(name)
Oh, what the heck is the VARCHAR(255)
for userID
?? Ids are normally integers.
Upvotes: 2