satoshi
satoshi

Reputation: 4103

Why are Postgres lookups on jsonb columns so slow?

I have a table targeting that has a column marital_status of type text[] and another column data of type jsonb. The content of these two columns is the same, just in a different format (it's just for demonstration purposes). Example data:

 id |      marital_status      |                        data                       
----+--------------------------+---------------------------------------------------
  1 | null                     | {}
  2 | {widowed}                | {"marital_status": ["widowed"]}
  3 | {never_married,divorced} | {"marital_status": ["never_married", "divorced"]}
...

There are more than 690K records in the table in random combination.

Lookup on text[] column

EXPLAIN ANALYZE SELECT marital_status
FROM targeting
WHERE marital_status @> '{widowed}'::text[]

Without index

Usually takes < 900ms without creating any indices:

Seq Scan on targeting  (cost=0.00..172981.38 rows=159061 width=28) (actual time=0.017..840.084 rows=158877 loops=1)
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
Planning time: 0.150 ms
Execution time: 845.731 ms

With index

With index it usually takes < 200ms (75% improvement):

CREATE INDEX targeting_marital_status_idx ON targeting ("marital_status");

Result:

Index Only Scan using targeting_marital_status_idx on targeting  (cost=0.42..23931.35 rows=159061 width=28) (actual time=3.528..143.848 rows=158877 loops=1)"
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
  Heap Fetches: 0
Planning time: 0.217 ms
Execution time: 148.506 ms

Lookup on jsonb column

EXPLAIN ANALYZE SELECT data
FROM targeting
WHERE (data -> 'marital_status') @> '["widowed"]'::jsonb

Without index

Usually takes < 5,700ms without creating any indices (more than 6 times slower!):

Seq Scan on targeting  (cost=0.00..174508.65 rows=611 width=403) (actual time=0.095..5399.112 rows=158877 loops=1)
  Filter: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Filter: 452033
Planning time: 0.172 ms
Execution time: 5408.326 ms

With index

With index it usually takers < 3,700ms (35% improvement):

CREATE INDEX targeting_data_marital_status_idx ON targeting USING GIN ((data->'marital_status'));

Result:

Bitmap Heap Scan on targeting  (cost=144.73..2482.75 rows=611 width=403) (actual time=85.966..3694.834 rows=158877 loops=1)
  Recheck Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Index Recheck: 201080
  Heap Blocks: exact=33723 lossy=53028
  ->  Bitmap Index Scan on targeting_data_marital_status_idx  (cost=0.00..144.58 rows=611 width=0) (actual time=78.851..78.851 rows=158877 loops=1)"
        Index Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Planning time: 0.257 ms
Execution time: 3703.492 ms

Questions

Upvotes: 6

Views: 6589

Answers (2)

Evan Carroll
Evan Carroll

Reputation: 1

Seems like an easy question. Essentially you're asking how come,

CREATE TABLE foo ( id int, key1 text );

Is faster than

CREATE TABLE bar ( id int, jsonb foo );

@Craig answers it here in the comment

GIN indexing is generally less efficient than a b-tree, so that much is expected.

Also the null value in that schema should read

SELECT jsonb_build_object('marital_status',ARRAY[null]);
     jsonb_build_object     
----------------------------
 {"marital_status": [null]}
(1 row)

And not {}. PostgreSQL takes numerous shortcuts to make updates to the jsonb objects quick, and to make indexing space-efficient.

If none of this make sense look at this pseudo-table.

CREATE TABLE foo ( id int, x text, y text, z text )
CREATE INDEX ON foo(x);
CREATE INDEX ON foo(y);
CREATE INDEX ON foo(z);

Here we have three btree indexes on the table. Let's look at a similar table..

CREATE TABLE bar ( id int, junk jsonb );
CREATE INDEX ON bar USING gin (junk);
INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42}$$);

For bar to be perform like foo, we'd need two btrees both of which would be individually larger than the single GIN index we have. And if you did

INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42, "z":3}$$);

We'd have to have another btree index on z, which again would be massive. You can see where I am going with this. jsonb is great, but the complexity of indexing and schema modeling doesn't parallel the database. You can't just reduce the database to a jsonb column, issue CREATE INDEX and expect the same performance.

Upvotes: 2

drs
drs

Reputation: 341

This might be an issue of using jsonb_ops (the default GIN index strategy) instead of jsonb_path_ops.

According to the docs: https://www.postgresql.org/docs/9.6/static/datatype-json.html

Although the jsonb_path_ops operator class supports only queries with the @> operator, it has notable performance advantages over the default operator class jsonb_ops. A jsonb_path_ops index is usually much smaller than a jsonb_ops index over the same data, and the specificity of searches is better, particularly when queries contain keys that appear frequently in the data. Therefore search operations typically perform better than with the default operator class.

The technical difference between a jsonb_ops and a jsonb_path_ops GIN index is that the former creates independent index items for each key and value in the data, while the latter creates index items only for each value in the data. [1] Basically, each jsonb_path_ops index item is a hash of the value and the key(s) leading to it; for example to index {"foo": {"bar": "baz"}}, a single index item would be created incorporating all three of foo, bar, and baz into the hash value. Thus a containment query looking for this structure would result in an extremely specific index search; but there is no way at all to find out whether foo appears as a key. On the other hand, a jsonb_ops index would create three index items representing foo, bar, and baz separately; then to do the containment query, it would look for rows containing all three of these items. While GIN indexes can perform such an AND search fairly efficiently, it will still be less specific and slower than the equivalent jsonb_path_ops search, especially if there are a very large number of rows containing any single one of the three index items.

Upvotes: 0

Related Questions