Reputation: 157
I am having trouble understanding why major compaction is differ from minor compaction. As far as I know, minor compaction is that merge some HFiles into one or little more HFiles.
And I think major compaction does almost the same thing except handling deleted rows..
So, I have no idea why major compaction brings back data locality of HBase(when it is used over HDFS).
In other words, why minor compaction cannot restore data locality, despite the fact that for me, minor compaction and major compaction is all just merging HFiles into small amount of HFiles.
And why only major compaction dramatically improves read performance? I think minor compaction also contributes to the improvement of read performance.
Please help me to understand.
Thank you in advance.
Upvotes: 4
Views: 6968
Reputation: 1403
Before understanding the difference between major and minor compactions, you need to understand the factors that impact performance from the point of view of compactions:
As you can imagine, the chances of having a poor locality for older data are higher due to restarts and rebalances.
Now, an easy way to understand the difference between minor and major compactions is as follows:
Minor Compaction: This compaction type is running all the time and focusses mainly on new files being written. By the virtue of being new, these files are small and can have delete markers for data in older files. Since this compaction is only looking at relatively newer files, it does not touch/delete data from older files. This means that until a different compaction type comes and deletes older data, this compaction type cannot remove the delete markers even from the newer files, otherwise those older deleted KeyValues will become visible again.
This leads to two outcomes:
As the files being touched are relatively newer and smaller, the capability to impact data locality is very low. In fact, during a write operation, a region server tries to write the primary replica of data on the local HDFS data node anyway. So, a minor compaction usually does not add much value to data locality.
Since the delete markers are not removed, some performance is still left on the table. That said, minor compactions are critical for HBase read performance as they keep the total file count under control which could be a big performance bottleneck especially on spinning disks if left unchecked.
Major Compaction: This type of compaction runs rarely (once a week by default) and focusses on complete cleanup of a store (one column family inside one region). The output of a major compaction is one file for one store. Since a major compaction rewrites all the data inside a store, it can remove both the delete markers and the older KeyValues marked as deleted by those delete markers.
This also leads to two outcomes:
Since delete markers and deleted data is physically removed, file sizes are reduced dramatically, especially in a system receiving a lot of delete operations. This can lead to a dramatic increase in performance in a delete-heavy environment.
Since all data of a store is being rewritten, it's a chance to restore the data locality for older (and larger) files also where the drift might have happened due to restarts and rebalances as explained earlier. This leads to better IO performance during reads.
More on HBase compactions: HBase Book
Upvotes: 8