Alex
Alex

Reputation: 1

MemSQL search performance: JSON vs varchar

I'm running some memsql performance tests on sample data and have a very poor behavior while querying JSON data. I have 2 tables looking very similar and containing exactly the same information (loaded from the same csv file). The difference is that the segments column is JSON vs varchar(255).

CREATE TABLE `test_events` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` JSON COLLATE utf8_bin NOT NULL,
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)


CREATE TABLE `test_events_string` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)

And an example data is (amount of items in array vary from 1 to 20):

memsql> select * from test_events limit 1;
+---------------------+---------+------------------------+
| timestamp           | user_id | segments               |
+---------------------+---------+------------------------+
| 2017-01-04 00:00:00 |   26834 | [19,18,9,6,7,22,34,43] |
+---------------------+---------+------------------------+

Below are 2 queries which fetch the same info, but speed is different. Both queries have been executed twice and I copy 2nd run:

memsql> select count(*) from test_events where json_array_contains_double(segments, 42);
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (15.86 sec)

memsql> select count(*) from test_events_string where segments like '%42%';
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (1.96 sec)

memsql> select count(*) from test_events;
+-----------+
| count(*)  |
+-----------+
| 306939340 |
+-----------+
1 row in set (0.02 sec)

So the JSON scan is 8 times slower than a %x% LIKE. Is there something which can improve it?

Maybe you can advice how to solve that business logic problem with a different approach? Basically, we log events for users and for each event we want to attach an array of ids of some entities. That array is frequently changed during user's lifecycle. We want to run queries filtering by 1 or many ids, pretty much like an example above.

Just in case, some tech specs. 3 identical bare metal servers. 1 server is for aggregator, 2 for data. Each machine has NUMA, so 4 leaf nodes total. Fast SSDs, 32Cores (2 X [email protected]), 32GB RAM.

Upvotes: 0

Views: 541

Answers (1)

Joseph Victor
Joseph Victor

Reputation: 829

I'm not surprised that this is slow. MemSQL uses a parquet based compression for columnar json, and we don't have these sorts of fast lookups quite yet (but stay tuned!).

There are a few options. One is, if you're always going to be searching for 42, you can use a persisted column (https://docs.memsql.com/docs/persistent-computed-columns). This seems unlikely to be your use case.

The other option is, if you are always looking at the same array, you can create a normalized table (https://en.wikipedia.org/wiki/Database_normalization).
SOmething like create table test_events_array (timestamp datetime not null, user_id bigint not null, segment bigint, shard(user_id), key(ts) using clustered columnstore) then doing select count(*) from test_events_array where segment=42 will be lightning fast.

It'll also compress down to almost nothing with that schema, probably.

Upvotes: 1

Related Questions