Join using Hadoop Map Reduce to join data from NoSQL databases

Question

I am currently using Solr as a NoSQL database. I have indexed various types of documents that sometimes have relationships between them.

For new use cases I have to perform the equivalent of a join which Solr does not support.

I was wondering if there is a way to submit a map-reduce job to hadoop where hadoop can then pull in the data from Solr and perform the join.

I am looking for:

a discussion
existing open source project which does this
example code
or a critique telling me this cannot be done either easily or in the general case.

Thanks in advance.

Note: I saw some questions here on related or similar topics: here, here and here but I didn't get what I was looking for.

jayunit100 · Accepted Answer

You have two basic options.

1) Use the SOLR REST API to manually join records by issuing lots of requests at the same time.

This strategy would require that you define a mapper with SOLR record ids or query terms, and then run all your mappers against a SOLR cluster. If you send out synchronous requests with timeouts, and have a reasonably performant solr cluster, the records will can then be wrriten to your reducer as necessary.

2) Read SOLR core indices directly in your mappers, and do a reduce side join.

This might be slightly more difficult. Because each core is indexed and written into a hierarchichal folder structure, you will have to have some logic in your mapper setup() method that might read the meta data from a given core. Also, you might have to put all your cores into HDFS, of course. But, it will be easy enough, once you have parsed the SOLR inputs in using the existing SOLR java index reader API, to emit these properly to your reducer for a standard reduce side join.

3) If one small data set (< 1G) is being joined to another large one, you can simply read it in by issueing REST queries and cache it in memory as a big, ugly, statically available object, or store its data in the distributed cache as a file. You might even simply be able to issue the queries in the setup() part of your mapper, and cache them locally per instance.

In any case: Joining data in SOLR is not particularly easy. Any solution you go for will have drawbacks. The proper solution will be to redo your SOLR indices so that they are sufficiently denormalized, and do your joins using a tool like standard map/reduce, HIVE or PIG.

Join using Hadoop Map Reduce to join data from NoSQL databases

Answers (1)

Related Questions