Why column-based data bases are easier to compress?

Question

Row-based databases look something like:

001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;

Column-based databases look something like:

10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;

Why do services such as RedShift claim to be able to compress data better? Or more generally why do column-based databases seem to support compression? It appears both these formats could be compressed easily.

Iłya Bursov · Accepted Answer

almost all compression algorithms utilize the fact that some data is repeated and thus can be compressed:

in row based storage: every record consists of different data types, so duplication is rare

in column based storage: every record consists of values the same type and usually contains duplicates

for your sample data:

all rows in row-based database are different and cannot be de-duplicated

in columns - at least last names have Jones 2 times, so instead of storing it 2 times, compression algorithm can store it one time and provide some reference (which is smaller) when it occurs second time

Why column-based data bases are easier to compress?

Answers (1)

Related Questions