Reputation: 117
I have a requirement in which i need to read a message from a kafka topic, do a lookup on a dataset and then send the message on dependent on the result of the lookup data. An example below to make this a bit clearer.
Kafka topic recieves an xml message which has a field messageID holding the value 2345
We do a lookup and confirm a message with this ID has not been sent before. If this comes back false we send the message on and then add this messageID to the lookup data. If this messageID is already in the lookup data we do not send it on.
Currently this is being achieved by using a hbase table to hold the lookup data. However we can recieve many millions of messages per day and i am concerned that the performance of the component will degrade over time.
Is there an alternative more optimised solution to using hbase for this lookup data such as storing this data in memory in an RDD? I attempted this but had some difficultly as spark contexts are obviously not serializable so i couldn't add to the existing lookuo dataset
Any suggestions are much appreciated!
Many thanks
Dan
Upvotes: 1
Views: 1225
Reputation: 6994
Spark is good for processing large volume data for analytic purposes. RDD abstraction is created to augment the performance limitation of the Map-Reduce process. Spark is not a replacement of the key/value store like HBase
.
Looking at your problem looks to me you require a cache layer on top of the HBase. This could be achieved by Redis
or other distributed caching mechanism.
RDD cachined won't help here because
Now you probably could build some bloom filters, indexing on your data and use Spark to lookup. However, probably that would be hard.
Upvotes: 1