Reputation: 63
I am student learning Google BigTable design.
I am confused, the SST table is sorted internally. Two SST tables may not be sorted. In this case, it seems BigTable doesn't support efficient range scan for primary key? For example, "select * where id between 100 and 200". BigTable may need to scan all the SST to get the result.
Then my understanding for why SST is sorted is because for single primary key query, we can do binary search within a SST.
Another question I have is, does MemTable sorted? If yes, how? Because MemTable need to update frequently. If use data structure like tree, then we need to traverse the tree when we write MemTable into SST?
Upvotes: 1
Views: 940
Reputation: 582
It sounds like you've at least been through an overview of the original Bigtable paper, but here's a reference if you haven't read the whole thing; your questions can mostly be answered by a closer read: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Your intuitions about Bigtable are spot on. Both the SStables on disk and the Memtable are sorted based on the primary key, and any read (not just a scan) requires consulting all of them to produce a merged view. However, note that they are all sorted on the same key, so this amounts to a parallel traversal. We seek to the beginning of the range to be read in each sstable and in the memtable, and traverse them in parallel from there.
This process is alluded to in section 5.3: "A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently."
Some of these lookups can be mitigated using Bloom Filters as described in section 6 of the paper, but not all.
Of course, if there are too many sstables in a tablet, this will still become inefficient. Section 5.4 of the paper goes into a bit more detail about how this is solved, which is to periodically merge the "stack" of sstables for a tablet into a smaller number of new sstables. If a particular tablet stops receiving writes, this will eventually quiesce its state down to a single file.
Regarding the efficiency of the memtable, the paper does not prescribe a particular data structure. However, suffice it to say that there are many examples of efficient in-memory sorted data structures. In addition, Section 5.4 mentions that there can actually be multiple memtables in a given tablet. By the time we scan a memtable to write it out to disk, we have already installed a new memtable at the top of the tablet "stack" and are serving the latest incoming reads from there.
Upvotes: 2