Reputation: 4058
I have use-case in which I have 3M records in my Mongodb.
I want to aggregate data based on some condition.
I found two ways to accomplish it
I successfully executed my use-case using the above methods and found similar performance of both.
My query is ?
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
Upvotes: 3
Views: 1234
Reputation: 18835
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)
If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:
Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.
If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.
You may find the following resources useful:
If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.
If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.
Upvotes: 4