Reputation: 1455
I am experiencing a high latency between Spark nodes and HBase nodes. The current resources I have require me to run HBase and Spark on different servers.
The HFiles are compressed with Snappy algorithm, which reduces the data size of each region from 50GB to 10GB.
Nevertheless, the data transferred on the wire is always decompressed, so reading takes lot of time - approximately 20 MB per sec, which is about 45 minutes for each 50GB region.
What can I do to make data reading faster? (Or, is the current throughput considered high for HBase?)
I was thinking to clone the HBase HFiles locally to the Spark machines, instead of continuously requesting data from the HBase. Is it possible?
What is the best practice for solving such an issue?
Thanks
Upvotes: 2
Views: 252
Reputation: 21
You are thinking in right direction. You can copy HFiles to HDFS cluster(or Machines) where spark is running. That would result in saving decompression and reduced data transfer over the wire. You would need to read HFiles from Snappy compression and write a parser to read.
Alternatively you can apply Column and ColumnFamily filters if you don't need all the data from Hbase.
Upvotes: 1