Junction table vs. many columns vs. arrays in PostgreSQL: Memory and performance

Question

I'm building a Postgres database for a product search (up to 3 million products) with large groups of similar data for each product, e.g. the prices for different countries, and country-specific average ratings, with up to 170 countries.

The natural solution seems to use arrays (e.g. a real[] column for the prices and another for the ratings). However, the data needs to be indexed individually for each country for sorting and range queries (the data for different countries is not reliably correlated). So from this discussion I think it would be better to use individual columns for each country.

There are about 8 country-specific properties of which maybe 4 need to be indexed, so I may end up with more than 1300 columns and 650 indexes. Might that be a problem? Is there a better solution?

EDIT after everyone is telling me about many-to-many relationships, normalization and so on:

I am not convinced. If I understand correctly, this always comes down to a junction table (known under many names), as in Erwin Brandstetter's answer.

As I mentioned in my first comment, this would be a great solution if for each product there were prices and ratings for a few countries only. If this is not the case however, a junction table may lead to a significantly higher memory requirement (consider the ever-repeated product-id and country-id, and even more serious, the row-overhead for a narrow table with hundreds millions of rows).

Here is a Python script to demonstrate this. It creates a junction table product_country for prices and ratings of products in different countries, and a "multi-column table" products for the same. The tables are populated with random values for 100,000 products and 100 countries.

For simplicity I use ints to identify products and countries, and for the junction-table-approach, I only build the junction table.

import psycopg2
from psycopg2.extras import execute_values
from random import random
from time import time

cn = psycopg2.connect(...)
cn.autocommit = True
cr = cn.cursor()

num_countries = 100
num_products = 100000


def junction_table():
    print("JUNCTION TABLE")

    cr.execute("CREATE TABLE product_country (product_id int, country_id int, "
               "price real, rating real, PRIMARY KEY (product_id, country_id))")

    t = time()
    for p in range(num_products):
        # use batch-insert, without that it would be about 10 times slower
        execute_values(cr, "INSERT INTO product_country "
                           "(product_id, country_id, price, rating) VALUES %s",
                       [[p, c, random() * 100, random() * 5]
                        for c in range(num_countries)])
    print(f"Insert data took {int(time() - t)}s")

    t = time()
    cr.execute("CREATE INDEX i_price ON product_country (country_id, price)")
    cr.execute("CREATE INDEX i_rating ON product_country (country_id, rating)")
    print(f"Creating indexes took {int(time() - t)}s")

    sizes('product_country')


def many_column_table():
    print("
MANY-COLUMN TABLE")

    cr.execute("CREATE TABLE products (product_id int PRIMARY KEY, "
               + ', '.join([f'price_{i} real' for i in range(num_countries)]) + ', '
               + ', '.join([f'rating_{i} real' for i in range(num_countries)]) + ')')

    t = time()
    for p in range(num_products):
        cr.execute("INSERT INTO products (product_id, "
                   + ", ".join([f'price_{i}' for i in range(num_countries)]) + ', '
                   + ", ".join([f'rating_{i}' for i in range(num_countries)]) + ') '
                   + "VALUES (" + ",".join(["%s"] * (1 + 2 * num_countries)) + ') ',
                   [p] + [random() * 100 for i in range(num_countries)]
                   + [random() * 5 for i in range(num_countries)])
    print(f"Insert data took {int(time() - t)}s")

    t = time()
    for i in range(num_countries):
        cr.execute(f"CREATE INDEX i_price_{i} ON products (price_{i})")
        cr.execute(f"CREATE INDEX i_rating_{i} ON products (rating_{i})")
    print(f"Creating indexes took {int(time() - t)}s")

    sizes('products')


def sizes(table_name):
    cr.execute(f"SELECT pg_size_pretty(pg_relation_size('{table_name}'))")
    print("Table size: " + cr.fetchone()[0])
    cr.execute(f"SELECT pg_size_pretty(pg_indexes_size('{table_name}'))")
    print("Indexes size: " + cr.fetchone()[0])


if __name__ == '__main__':
    junction_table()
    many_column_table()

Output:

JUNCTION TABLE
Insert data took 179s
Creating indexes took 28s
Table size: 422 MB
Indexes size: 642 MB

MANY-COLUMN TABLE
Insert data took 138s
Creating indexes took 31s
Table size: 87 MB
Indexes size: 433 MB

Most importantly, the total size (table+indexes) of the junction table is about twice the size of the many-column table, and the table-only size is even nearly 5 times larger.

This is easily explained by the row-overhead and the repeated product-id and country-id in each row (10,000,000 rows, vs. just 100,000 rows of the many-column table).

The sizes scale approximately linearly with the number of products (I tested with 700,000 products), so for 3 million products the junction table would be about 32 GB (12.7 GB relation + 19.2 GB indexes), while the many-column table would be just 15.6 GB (2.6 GB table + 13 GB indexes), which is decisive if everything should be cached in RAM.

Query times are about the same when all is cached, here a somewhat typical example for 700,000 products :

EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price, rating FROM product_country
WHERE country_id=7 and price < 10
ORDER BY rating DESC LIMIT 200

-- Limit  (cost=0.57..1057.93 rows=200 width=12) (actual time=0.037..2.250 rows=200 loops=1)
--   Buffers: shared hit=2087
--   ->  Index Scan Backward using i_rating on product_country  (cost=0.57..394101.22 rows=74544 width=12) (actual time=0.036..2.229 rows=200 loops=1)
--         Index Cond: (country_id = 7)
--         Filter: (price < '10'::double precision)
--         Rows Removed by Filter: 1871
--         Buffers: shared hit=2087
-- Planning Time: 0.111 ms
-- Execution Time: 2.364 ms

EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price_7, rating_7 FROM products
WHERE price_7 < 10
ORDER BY rating_7 DESC LIMIT 200

-- Limit  (cost=0.42..256.82 rows=200 width=12) (actual time=0.023..2.007 rows=200 loops=1)
--   Buffers: shared hit=1949
--   ->  Index Scan Backward using i_rating_7 on products  (cost=0.42..91950.43 rows=71726 width=12) (actual time=0.022..1.986 rows=200 loops=1)
--         Filter: (price_7 < '10'::double precision)
--         Rows Removed by Filter: 1736
--         Buffers: shared hit=1949
-- Planning Time: 0.672 ms
-- Execution Time: 2.265 ms

Regarding flexibility, data integrity etc., I see no serious problem with the multi-column approach: I can easily add and delete columns for countries, and if a sensible naming scheme is used for the columns it should be easy to avoid mistakes.

So I think I have every reason not to use a junction table.

Further, with arrays all would be clearer and simpler than with many columns, and if there were a way to easily define individual indexes for the array elements, that would be the best solution (maybe even the total indexes-size could be reduced?).

So I think my original question is still valid. However there is much more to consider and to test of course. Also, I'm in no way a database expert, so tell me if I'm wrong.

Here the test tables from the script for 5 products and 3 countries:

Junction table vs. many columns vs. arrays in PostgreSQL: Memory and performance

Answers (1)

Related Questions