Mongo error 16996 during aggregation - too large document produced

Question

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:

{
    _id : "Tree",
    id: "18955875",
    linksFrom: " [
        {
        name: "Forest",
        count: 6
        },
        [...]
    ],
    categories: [
        "Trees",
        "Forest_ecology"
        [...]
    ]
}

The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:

db.runCommand(
    {
        aggregate: "articles",
        pipeline : [
            {
                $unwind: "$linksFrom"
            },
            {
                $sort: { "linksFrom.count": -1 }
            },
            {
                $project:
                {
                    name: "$_id",
                    linksFrom: "$linksFrom"
                }
            },
            {
                $group:
                {
                    _id: "$linksFrom.name",
                    linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
                }
            },
            {
                $out: "TEMPORARY"
            }
        ] ,
        allowDiskUse: true
    }
)

However, on a large dataset being the english Wikipedia I get the following error after a few minutes:

{
    "ok" : 0,
    "errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
    "code" : 16996
}

I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...

The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.

UPDATE after comment by Markus

Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:

{
    _id : "tree",
    stemmedName: "tree",
    targetArticle: "Christmas_tree"
}

_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.

The second collection labels contains documents as follows:

{
    _id : "tree",
    targetArticles: [
        {
            name: "Christmas_tree",
            count: 1
        },
        {
            name: "Tree",
            count: 166
        }
        [...]
    ]
}

The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.

So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:

Look for label with a given name.
Get all articles it might point to.
Compare the incoming and outcoming links for these articles.

This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?

Markus W Mahlberg · Accepted Answer

I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.

So basically, I think you should change your model. I will show you how I would do it.

I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.

The questions

Lets begin with the questions you want to have answered:

For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?

The crawling

What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:

{
  "_id": new ObjectId(),
  // optional, for recrawling or pointing out a given state
  "date": new ISODate(),
  "article": wikiUrl,
  "linksTo": otherWikiUrl
}

Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.

So our statement when encountering a link would look like this for example:

db.links.update(
  { "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
  { "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },   
  { upsert:true }
)

Answering the questions

As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:

db.links.update(
  { "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
  { "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
  { upsert:true }
)
db.links.update(
  { "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
  { "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
  { upsert:true }
)
db.links.update(
  { "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
  { "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
  { upsert:true }
)

For a given article, which other articles link to that article?

We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:

var toLinks =[]

var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
  function(doc){
    toLinks.push(doc.article);
  }
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]

For a given article, to which other articles does that article link to?

This works pretty much like the first question – we basically only change the query:

var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
  function(doc){
    fromLinks.push(doc.linksTo)
  }
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]

For a given article, how many articles link to it?

It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this

Using `.count()`

You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:

db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2

Using an aggregation

This works on any environment and isn't much more complicated:

db.links.aggregate([
  { "$match":{ "linksTo":"Royal_Navy" }},
  { "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }

Optional: For a given article, to how many articles does it link to?

Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple

db.links.aggregate([
  { "$match":{ "article":"Royal_Navy" }},
  { "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }

Indices

As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:

db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})

A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.

Conclusion

We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

Mongo error 16996 during aggregation - too large document produced

Answers (1)

The questions

The crawling

Answering the questions

For a given article, which other articles link to that article?

For a given article, to which other articles does that article link to?

For a given article, how many articles link to it?

Using `.count()`

Using an aggregation

Optional: For a given article, to how many articles does it link to?

Indices

Conclusion

Related Questions

Mongo error 16996 during aggregation - too large document produced

Answers (1)

The questions

The crawling

Answering the questions

For a given article, which other articles link to that article?

For a given article, to which other articles does that article link to?

For a given article, how many articles link to it?

Using .count()

Using an aggregation

Optional: For a given article, to how many articles does it link to?

Indices

Conclusion

Related Questions

Using `.count()`