Reputation: 4103
I have a table targeting
that has a column marital_status
of type text[]
and another column data
of type jsonb
. The content of these two columns is the same, just in a different format (it's just for demonstration purposes). Example data:
id | marital_status | data
----+--------------------------+---------------------------------------------------
1 | null | {}
2 | {widowed} | {"marital_status": ["widowed"]}
3 | {never_married,divorced} | {"marital_status": ["never_married", "divorced"]}
...
There are more than 690K records in the table in random combination.
EXPLAIN ANALYZE SELECT marital_status
FROM targeting
WHERE marital_status @> '{widowed}'::text[]
Usually takes < 900ms without creating any indices:
Seq Scan on targeting (cost=0.00..172981.38 rows=159061 width=28) (actual time=0.017..840.084 rows=158877 loops=1)
Filter: (marital_status @> '{widowed}'::text[])
Rows Removed by Filter: 452033
Planning time: 0.150 ms
Execution time: 845.731 ms
With index it usually takes < 200ms (75% improvement):
CREATE INDEX targeting_marital_status_idx ON targeting ("marital_status");
Result:
Index Only Scan using targeting_marital_status_idx on targeting (cost=0.42..23931.35 rows=159061 width=28) (actual time=3.528..143.848 rows=158877 loops=1)"
Filter: (marital_status @> '{widowed}'::text[])
Rows Removed by Filter: 452033
Heap Fetches: 0
Planning time: 0.217 ms
Execution time: 148.506 ms
EXPLAIN ANALYZE SELECT data
FROM targeting
WHERE (data -> 'marital_status') @> '["widowed"]'::jsonb
Usually takes < 5,700ms without creating any indices (more than 6 times slower!):
Seq Scan on targeting (cost=0.00..174508.65 rows=611 width=403) (actual time=0.095..5399.112 rows=158877 loops=1)
Filter: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Rows Removed by Filter: 452033
Planning time: 0.172 ms
Execution time: 5408.326 ms
With index it usually takers < 3,700ms (35% improvement):
CREATE INDEX targeting_data_marital_status_idx ON targeting USING GIN ((data->'marital_status'));
Result:
Bitmap Heap Scan on targeting (cost=144.73..2482.75 rows=611 width=403) (actual time=85.966..3694.834 rows=158877 loops=1)
Recheck Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Rows Removed by Index Recheck: 201080
Heap Blocks: exact=33723 lossy=53028
-> Bitmap Index Scan on targeting_data_marital_status_idx (cost=0.00..144.58 rows=611 width=0) (actual time=78.851..78.851 rows=158877 loops=1)"
Index Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Planning time: 0.257 ms
Execution time: 3703.492 ms
text[]
column so much more performant, even when not using indices?jsonb
column only increase the performance by 35%?jsonb
column?Upvotes: 6
Views: 6589
Reputation: 1
Seems like an easy question. Essentially you're asking how come,
CREATE TABLE foo ( id int, key1 text );
Is faster than
CREATE TABLE bar ( id int, jsonb foo );
@Craig answers it here in the comment
GIN indexing is generally less efficient than a b-tree, so that much is expected.
Also the null value in that schema should read
SELECT jsonb_build_object('marital_status',ARRAY[null]);
jsonb_build_object
----------------------------
{"marital_status": [null]}
(1 row)
And not {}
. PostgreSQL takes numerous shortcuts to make updates to the jsonb objects quick, and to make indexing space-efficient.
If none of this make sense look at this pseudo-table.
CREATE TABLE foo ( id int, x text, y text, z text )
CREATE INDEX ON foo(x);
CREATE INDEX ON foo(y);
CREATE INDEX ON foo(z);
Here we have three btree indexes on the table. Let's look at a similar table..
CREATE TABLE bar ( id int, junk jsonb );
CREATE INDEX ON bar USING gin (junk);
INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42}$$);
For bar
to be perform like foo
, we'd need two btrees both of which would be individually larger than the single GIN index we have. And if you did
INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42, "z":3}$$);
We'd have to have another btree index on z
, which again would be massive. You can see where I am going with this. jsonb
is great, but the complexity of indexing and schema modeling doesn't parallel the database. You can't just reduce the database to a jsonb column, issue CREATE INDEX
and expect the same performance.
Upvotes: 2
Reputation: 341
This might be an issue of using jsonb_ops
(the default GIN index strategy) instead of jsonb_path_ops
.
According to the docs: https://www.postgresql.org/docs/9.6/static/datatype-json.html
Although the
jsonb_path_ops
operator class supports only queries with the@>
operator, it has notable performance advantages over the default operator classjsonb_ops
. Ajsonb_path_ops
index is usually much smaller than ajsonb_ops
index over the same data, and the specificity of searches is better, particularly when queries contain keys that appear frequently in the data. Therefore search operations typically perform better than with the default operator class.The technical difference between a
jsonb_ops
and ajsonb_path_ops
GIN index is that the former creates independent index items for each key and value in the data, while the latter creates index items only for each value in the data. [1] Basically, each jsonb_path_ops index item is a hash of the value and the key(s) leading to it; for example to index{"foo": {"bar": "baz"}}
, a single index item would be created incorporating all three of foo, bar, and baz into the hash value. Thus a containment query looking for this structure would result in an extremely specific index search; but there is no way at all to find out whether foo appears as a key. On the other hand, ajsonb_ops
index would create three index items representing foo, bar, and baz separately; then to do the containment query, it would look for rows containing all three of these items. While GIN indexes can perform such an AND search fairly efficiently, it will still be less specific and slower than the equivalentjsonb_path_ops
search, especially if there are a very large number of rows containing any single one of the three index items.
Upvotes: 0