MemSQL search performance: JSON vs varchar

Question

I'm running some memsql performance tests on sample data and have a very poor behavior while querying JSON data. I have 2 tables looking very similar and containing exactly the same information (loaded from the same csv file). The difference is that the segments column is JSON vs varchar(255).

CREATE TABLE `test_events` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` JSON COLLATE utf8_bin NOT NULL,
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)


CREATE TABLE `test_events_string` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)

And an example data is (amount of items in array vary from 1 to 20):

memsql> select * from test_events limit 1;
+---------------------+---------+------------------------+
| timestamp           | user_id | segments               |
+---------------------+---------+------------------------+
| 2017-01-04 00:00:00 |   26834 | [19,18,9,6,7,22,34,43] |
+---------------------+---------+------------------------+

Below are 2 queries which fetch the same info, but speed is different. Both queries have been executed twice and I copy 2nd run:

memsql> select count(*) from test_events where json_array_contains_double(segments, 42);
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (15.86 sec)

memsql> select count(*) from test_events_string where segments like '%42%';
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (1.96 sec)

memsql> select count(*) from test_events;
+-----------+
| count(*)  |
+-----------+
| 306939340 |
+-----------+
1 row in set (0.02 sec)

So the JSON scan is 8 times slower than a %x% LIKE. Is there something which can improve it?

Maybe you can advice how to solve that business logic problem with a different approach? Basically, we log events for users and for each event we want to attach an array of ids of some entities. That array is frequently changed during user's lifecycle. We want to run queries filtering by 1 or many ids, pretty much like an example above.

Just in case, some tech specs. 3 identical bare metal servers. 1 server is for aggregator, 2 for data. Each machine has NUMA, so 4 leaf nodes total. Fast SSDs, 32Cores (2 X E5-2650v2@2.60GHz), 32GB RAM.

MemSQL search performance: JSON vs varchar

Answers (1)

Related Questions