okmich
okmich

Reputation: 740

Analytical Queries with MongoDB

I am new to MongoDB and I have difficulties implementing a solution in it. Consider a case where I have two collections: a client and sales collection with such designs

Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion

Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales

Now there is a requirement to develop another collection for analytical queries where the following questions can be answered

  1. What are the distribution of sales by gender, region and emp_status?
  2. What are the mostly purchase products for clients in a particular region?

I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions. In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.

Questions: How do I implement Map-Reduce to map across 2 collections? Is it possible to chain MapReduce operations?

Regards.

Upvotes: 1

Views: 658

Answers (1)

Philipp
Philipp

Reputation: 69683

MongoDB does not do JOINs - period!

MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.

When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and

  1. finds all Sales documents for the clinet
  2. merges the relevant fields from Client into each document
  3. inserts the resulting document into the new collection

When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.

Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.

Upvotes: 2

Related Questions