uwegeercken
uwegeercken

Reputation: 21

Mongodb mapreduce missing documents

I have a strange situation with map reduce. The result is not considering all records, although it should.

I have a collection of tweets like this shown below. I have 230 documents - my query is on createdyear. here is a sample

{
    "_id" : ObjectId("56e55b52330dfb156547d559"),
    "message" : "RT @TwitFAKTA: Kiper MU, David De Gea mempunyai ritual unik sebelum bertanding, yaitu memutar lagu-lagu Metallica dengan keras.",
    "createdyear" : "2016",
    "handle" : "xxx",
    "createdtime" : "13:23:33",
    "searchtopic" : "Metallica",
    "createdmonth" : "03",
    "createddate" : "2016-03-13",
    "user" : "xxx"
}

My map function is like this. Very simple: The final result shall be the count of tweets per topic and month.

function(){ 
    emit({topic: this.searchtopic, month: this.createdmonth},1) 
};

and here the reduce function: I am simply counting the number of values for the given key.

function(key,value) {
    var counter=0; 
    for (var i=0;i<value.length;i++) { 
        counter = counter +1; 
    }
    return counter; 
};

and then I mapreduce and store the output in the collection.

db.tweets.mapReduce(map,reduce,{out: "mapreduce_test"})

the result is this:

{
    "result" : "mapreduce_test",
    "timeMillis" : 6,
    "counts" : {
        "input" : 230,
        "emit" : 230,
        "reduce" : 4,
        "output" : 2
    },
    "ok" : 1
}

the map reduce works, but the results are not correct. when I list the output from mapreduce I get the following:

{ "_id" : { "topic" : "3 Doors Down", "month" : "03" }, "value" : 2 }
{ "_id" : { "topic" : "Metallica", "month" : "03" }, "value" : 31 }

When I manually search for the documents, I get 228 for Metallica and 2 for 3 Doors Down. These are the 230 input and emitted records.

So where are the other documents? What happened?

Normally I have a process that gets the tweets from Twitter and stores them in mongodb. So the collection is always getting bigger. When I run the mapreduce task regularly via cron I noticed, that It works for a while and then suddenly the wrong results come back. Have a look:

Sun Mar 13 14:30:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 47.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 14:40:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 67.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 14:50:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 87.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 15:00:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 7.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

The number of documents is growing and then suddenly at 15:00 it dropped. Although I still have the documents in the database - I checked it multiple times.

I have also run this on a second machine, but with the same results.

Does anybody have an explanation for this behavior?

Thanks,

Uwe

Upvotes: 0

Views: 284

Answers (1)

Joachim Isaksson
Joachim Isaksson

Reputation: 181077

Since MongoDB batches reduce, you can't just sum up 1 in your reduce, you actually need to sum up value[i];

function(key,value) {
    var counter=0; 
    for (var i=0;i<value.length;i++) { 
        counter = counter + value[i]; 
    }
    return counter; 
};

Let's say the batch size is 100. MongoDB gets passed 100 values in the first batch (summing up to 100) and when it runs the next batches it gets passed 101 values (one with the sum so far + 100 new values)

When you sum 1 instead of value[i], you always count the total sum from the previous batches as 1.

Upvotes: 1

Related Questions