In mongodb storing tags, should I use joint string or array?

Question

In most tutorial the tags will be stored like [tag, tag, tag]. I have another thought that saving the tag like this: "tag.tag.tag" , e.g. "web.javascript.angularJS", and then query the document like this

db.articles.find({'tags': /javascript/})

I suppose that finding sub string is faster than element in an array. Does anyone has similar experience like this.

Markus W Mahlberg · Accepted Answer

Introduction

Data modeling in MongoDB is done by first identifying the questions you need to get answered and derive your optimized data model from those questions. In your case, your question seems to be

For a given tag, what are the articles?

In order to have your queries answered as fast as possible, you need to put an index on them. An index basically is a key-value store of keys as the user defines them and document locations in the data files.

Look into the indices

We will have a look at how an index will look like if you wrote the tags in a single string. Assume we have three documents, each with three tags, two of them with the normalized tag of "javascript". Simplified a great deal (actually, indexes are stored in B-trees), our index will look like this:

"foo.bar.baz": LocationOfDocument1;
"foo.javascript.bar": LocationOfDocument2;
"bar.javascript.baz": LocationOfDocument3;

As you can see, we have a lot of redundancy on the key side here. This has two problems. The first problem is that even when a tag is found, the index still might offer additional hits, so our query takes more than the optimal time. The second problem is that the redundancy eats up precious RAM. Image you had hundreds of thousands or even millions of articles.

So, how would our index look like if we used an array to store tags?

"foo":[ LocationOfDocument1, LocationOfDocument2 ];
"bar":[ LocationOfDocument2, LocationOfDocument2, LocationOfDocument3 ];
"baz":[ LocationOfDocument1, LocationOfDocument3 ];
"javascript":[ LocationOfDocument2, LocationOfDocument3 ];

Still redundant, right? Well, except for some factors: we have reduced the size of the key side a great deal and the "LocationOfDocumentX" values are rather cheap to store when compared to rather expensive lengthy strings as keys in a B-tree. (Excursus: I think the document locations are stored as 4 byte integers.) So our index might have more entries, but it is by far more compact.

Plus, we have an additional advantage: We can ditch the rather costly regex. Put different: You can eliminate the cost of using a regex against the keys of the index to find your search string by using a simple equality expression. In shell terms, this would look like:

db.articles.insert({"foo":bar,tags:[tag1.toLowerCase(), tag2.toLowerCase()]})
…
db.articles.find({"tags":inputStr.toLowerCase()})

With the indices stored in a B-Tree, your search time is reduced a great deal. There is yet another advantage. Since B-Trees are sorted, when we found a positive match, e.g. on "javascript", we will have all documents with those tags and the index processing can stop. With a regex search on the keys, all keys of the index would have to be processed each and every time – and with a rather costly operation, too.

Conclusion

With the tags stored in an array, you will speed up the average lookup time for a given tag and are surely not worse than an index on tags reduced to rather long strings. Furthermore, you need less RAM for storing the index, which is rather important when it comes to scaling.

Anticipating according comments: Yes, that is what the data and experience shows, too.

Note I hesitate from making the following suggestion (since this might well do more harm than good), but there are use cases in which a text search index might make sense. For example, when you want to do a case insensitive search for "JavaScript" both on the tags, titles and text of the articles. Using a text index comes with some intricacies out of scope of this answer, however. And still, you would have your tags in an array.

In mongodb storing tags, should I use joint string or array?

Answers (1)

Introduction

Look into the indices

Conclusion

Related Questions