DPEZ
DPEZ

Reputation: 117

Best option for lookup data in spark

I have a requirement in which i need to read a message from a kafka topic, do a lookup on a dataset and then send the message on dependent on the result of the lookup data. An example below to make this a bit clearer.

Kafka topic recieves an xml message which has a field messageID holding the value 2345

We do a lookup and confirm a message with this ID has not been sent before. If this comes back false we send the message on and then add this messageID to the lookup data. If this messageID is already in the lookup data we do not send it on.

Currently this is being achieved by using a hbase table to hold the lookup data. However we can recieve many millions of messages per day and i am concerned that the performance of the component will degrade over time.

Is there an alternative more optimised solution to using hbase for this lookup data such as storing this data in memory in an RDD? I attempted this but had some difficultly as spark contexts are obviously not serializable so i couldn't add to the existing lookuo dataset

Any suggestions are much appreciated!

Many thanks

Dan

Upvotes: 1

Views: 1225

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6994

Spark is good for processing large volume data for analytic purposes. RDD abstraction is created to augment the performance limitation of the Map-Reduce process. Spark is not a replacement of the key/value store like HBase.
Looking at your problem looks to me you require a cache layer on top of the HBase. This could be achieved by Redis or other distributed caching mechanism.
RDD cachined won't help here because

  1. It can't be guaranteed that whole data is in memory
  2. Paired rdd supports key value based lookup however it follows the map reduce pattern for finding the key. RDD is an abstraction which keeps the information about the Data location and the Lineage DAG information in it. RDD doesn't materialize the data unless some action happens on it. Now even if you keep all the data in a cache, RDD needs to search the data for lookup. It is not like HBase where you have an index of the key and lookup can be done in constant time.

Now you probably could build some bloom filters, indexing on your data and use Spark to lookup. However, probably that would be hard.

Upvotes: 1

Related Questions