Reputation:
Row-based databases look something like:
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;
Column-based databases look something like:
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;
Why do services such as RedShift claim to be able to compress data better? Or more generally why do column-based databases seem to support compression? It appears both these formats could be compressed easily.
Upvotes: 0
Views: 97
Reputation: 24146
almost all compression algorithms utilize the fact that some data is repeated and thus can be compressed:
in row based storage: every record consists of different data types, so duplication is rare
in column based storage: every record consists of values the same type and usually contains duplicates
for your sample data:
all rows in row-based database are different and cannot be de-duplicated
in columns - at least last names have Jones
2 times, so instead of storing it 2 times, compression algorithm can store it one time and provide some reference (which is smaller) when it occurs second time
Upvotes: 2