Tomáš Repík
Tomáš Repík

Reputation: 155

How to improve query performance on a single node in MonetDB?

I've installed the latest (MonetDB 5 server v11v.27.5 "Jul2017-SP1") on Windows 2012 Server and I'm trying to query large table 1,4 billion rows in a reasonable time 2-3s.

Is this even possible with MonetDB? What could I do to improve the performance?

Detailed description of what I've done so far:

  1. Created table:

    CREATE TABLE t939ba ( id INT, xa INT, xb INT, ya INT, yb INT, a1 TINYINT, a2 TINYINT, a3 TINYINT, a4 TINYINT, a5 TINYINT, a6 TINYINT, a7 TINYINT, a8 TINYINT, a9 TINYINT);
    
  2. Loaded the data:

    COPY 1450000000 OFFSET 2 RECORDS INTO tbl FROM 'D:\\es_export\\file.csv'
    USING DELIMITERS ',' NULL AS '' LOCKED;
    
  3. Run the query:

    SELECT COUNT(DISTINCT id) FROM tbl WHERE a1=22
    AND xb>=143455 AND yb>=90911 AND xa<=143615 AND ya<=91007
    AND a2 IN (2, 3, 4) AND a3 IN (0, 1, 2, 3, 4) AND a4 IN (0, 1, 2)
    AND a5 IN (-1, 1, 2, 3, 4, 5, 6, 7) AND a6 IN (-1, 11, 12, 13, 14);
    

When I run the query for the 1st time it took (14m 52s), 2nd run of the same query took (3m 23s), 3rd consecutive run of the same query took (14s) and a slightly rearranged query took (3m 11s).

Upvotes: 3

Views: 602

Answers (2)

Stefan Manegold
Stefan Manegold

Reputation: 36

Tomas,

thanks for the plans and traces. I see that you used the revised query with range predicates instead of IN predicates, and that this query now runs in "a mere" ~39 s (compared to ~15 min) --- either because the range predicates are evaluated more efficiently than the IN predicates, or because, as Martin indicated, a later run of the query benefits from indexes that where automatically built by MonetDB when evaluating the first query, or because of both.

In any case, running a/each query (-version) more than once is a good idea to see the possible effect of automatically built indexes.

Further, I see that either you indeed have a 34-core machine, or your machine has "only" 2 GB RAM per core --- not too much given that you have a ~42 GB data set, where each column is ~1.5 GB to ~6 GB in size ...

Hence, the main reason for the query not running faster than ~39 s might be I/O activity due to "lack" of memory.

Best,

Stefan

ps:
You can check whether for this specific query, reducing (or even avoiding) multi-core parallelism helps to reduce I/O thrashing:
Try running your query after disabling MonetDB's "mitosis" optimizer using

set optimizer='no_mitosis_pipe';

You can re-enable full multi-core parallelism using

set optimzer='default_pipe';

Best,
Stefan

Upvotes: 1

Stefan Manegold
Stefan Manegold

Reputation: 36

Tomáš,

in general, I think this should be possible. Why is is not in your case is hard to says given the little information. Could you possibly share (some of) the following information:

  • hardware characteristics: CPU (type, #cores, clockspeed), amount of RAM, type of I/O system (single HDD, HDD RAID, SSD, NVMe, ...)

Further, to understand where time goes, one would need to know the query plan MonetDB generates and uses, and profile the query.

Could you produce the PLAN (logical plan), EXPLAIN (physical plan), and TRACE (execution time profiling) of your query (see https://www.monetdb.org/Documentation/Manuals/SQLreference/Runtime for details), and share them (if not here then via email)?

Could you try to run on a non-Windows (preferable Linux) system? We don't have the best experiences regarding performance with Windows ...

Thanks!

Stefan

ps:

You can also try to revise your query slightly as follows, and see whether this helps:

SELECT COUNT(DISTINCT id) FROM tbl WHERE
            a1=22
        AND xb>=143455
        AND yb>=90911
        AND xa<=143615
        AND ya<=91007
        AND a2 between 2 and 4
        AND a3 between 0 and 4
        AND a4 between 0 and 2
        AND a5 IN (-1, 1, 2, 3, 4, 5, 6, 7)
        AND a6 IN (-1, 11, 12, 13, 14)
;

or even

SELECT COUNT(DISTINCT id) FROM tbl WHERE
            a1=22
        AND xb>=143455
        AND yb>=90911
        AND xa<=143615
        AND ya<=91007
        AND a2 between 2 and 4
        AND a3 between 0 and 4
        AND a4 between 0 and 2
        AND (a5 = -1 or a5 between 1 and 7)
        AND (a6 = -1 or a6 between 11 and 14)
;

Also, could you possibly check and share the following statistics of your data:

select
        count(*),
        count(id), count(distinct id),
        count(xa), count(distinct xa),
        count(xb), count(distinct xb),
        count(ya), count(distinct ya),
        count(yb), count(distinct yb),
        count(a1), count(distinct a1),
        count(a2), count(distinct a2),
        count(a3), count(distinct a3),
        count(a4), count(distinct a4),
        count(a5), count(distinct a5),
        count(a6), count(distinct a6),
        count(a7), count(distinct a7),
        count(a8), count(distinct a8),
        count(a9), count(distinct a9)
from tbl
;

select count(*) from tbl where a1=22;
select count(*) from tbl where xb>=143455;
select count(*) from tbl where yb>=90911;
select count(*) from tbl where xa<=143615;
select count(*) from tbl where ya<=91007;
select count(*) from tbl where a2 IN (2, 3, 4);
select count(*) from tbl where a3 IN (0, 1, 2, 3, 4);
select count(*) from tbl where a4 IN (0, 1, 2);
select count(*) from tbl where a5 IN (-1, 1, 2, 3, 4, 5, 6, 7);
select count(*) from tbl where a6 IN (-1, 11, 12, 13, 14);

Upvotes: 0

Related Questions