AME
AME

Reputation: 2609

Do I have to create a surrogate key if I want to save space?

Let's say I have a very large table with owners of cars like so:

OWNERSHIP
owner    | car
---------------
steven   | audi
bernahrd | vw
dieter   | vw
eike     | vw
robert   | audi
... one hundred million rows ...

If I refactor it to this:

OWNERSHIP
owner    | car <-foreign key TYPE.car_type
---------------
steven   | audi
bernahrd | vw
dieter   | vw
eike     | vw
robert   | audi
...


TYPE
car_type      |
---------------
audi
vw

Do I win anything spacewise or speedwise or do I need to create an INTEGER surrogate key on car_type for that?

Upvotes: 1

Views: 904

Answers (2)

Daniel Lyons
Daniel Lyons

Reputation: 22803

The integer is going to take up 4 bytes, which is one more byte than "vw" will. As it happens, PostgreSQL enums take up 4 bytes too, so you won't gain anything storage-wise by switching to this representation (except for the difficulties it imposes on changing the enum itself). Querying will be as fast either way, because with a table that size you're going to be consulting the index anyway. Database performance, especially when tables get large, is essentially a matter of I/O, not CPU performance. I'm not convinced that an index on integers is going to be smaller or faster than an index on short strings, especially when you have a huge number of rows referencing a very small set of possible values. It's certainly not going to be the bottleneck in your applications.

Even if we assume that you were able to recover 4 bytes by using an artificial key, how much storage are you going to save? 4 bytes times 100 million rows would be about 400 MB ideally. Are you so pressed for storage that you need to eek out a small amount like that, on your honkin' database server? And this is assuming you refactor it into its own table and use a proper foreign key.

The right way to answer this, of course, is not to argue from first principles at all. Take your 100 million row table and work it both ways. Then examine the size yourself, like so:

SELECT pg_size_pretty(pg_total_relation_size('ownership')));
SELECT pg_size_pretty(pg_total_relation_size('ownership2')));

Do your test queries, with EXPLAIN ANALYZE like so:

EXPLAIN ANALYZE SELECT * FROM ownership WHERE car = 'audi';
EXPLAIN ANALYZE SELECT * FROM ownership2 WHERE car_id = 1;

Pay more attention to the actual time taken than the cost, but do look at the cost. Do this on the same database server as your production, if possible; if not, a similar machine with the same PostgreSQL configuration. Then you'll have hard numbers to tell you what you're paying for and what you're getting. My suspicion is that you'll find the space usage to be slightly worse with the artificial key and the performance to be equivalent.

If that's what you find, do the relational thing and use the natural key, and stop worrying so much about optimizing the physical storage. Space is the cheapest commodity you have.

Upvotes: 4

Tometzky
Tometzky

Reputation: 23890

Using two tables and string foreign key would of course use more space than using one. How much more depends on how many types of cars you have.

You should use integer car_id:

  • Using integer keys would save space if significant percentage of car names would repeat.

  • More so if you'd need to index car column, as integer index is much smaller than string index.

  • Also comparing integers is faster than comparing strings, so searching by car should also be faster.

  • Smaller table means that bigger part if it would fit in cache, so accessing it should also be faster.

Upvotes: 1

Related Questions