imriqwe
imriqwe

Reputation: 1455

Reducing the latency between Spark and HBase nodes

I am experiencing a high latency between Spark nodes and HBase nodes. The current resources I have require me to run HBase and Spark on different servers.

The HFiles are compressed with Snappy algorithm, which reduces the data size of each region from 50GB to 10GB.

Nevertheless, the data transferred on the wire is always decompressed, so reading takes lot of time - approximately 20 MB per sec, which is about 45 minutes for each 50GB region.

What can I do to make data reading faster? (Or, is the current throughput considered high for HBase?)

I was thinking to clone the HBase HFiles locally to the Spark machines, instead of continuously requesting data from the HBase. Is it possible?

What is the best practice for solving such an issue?

Thanks

Upvotes: 2

Views: 252

Answers (1)

Prashant Mittal
Prashant Mittal

Reputation: 21

You are thinking in right direction. You can copy HFiles to HDFS cluster(or Machines) where spark is running. That would result in saving decompression and reduced data transfer over the wire. You would need to read HFiles from Snappy compression and write a parser to read.

Alternatively you can apply Column and ColumnFamily filters if you don't need all the data from Hbase.

Upvotes: 1

Related Questions