Is there a way to query Mongo based on the length in bytes of a particular field?

Question

Since the hard limit of 1024 bytes for indexes in MongoDB 2.6.x, I've had to remove a very useful compound index which included a text field which sometimes was quite long and contained high Unicode characters thus exceeding the byte limit.

I've had to replace it with a hashed index on that single field which forces MongoDB to open the BSON an inspect the other fields outside of the hashed index.

I'd like to try and remove these extra long results (so I can restore the original compound index), but I don't know how to query where that field's data exceed a certain number of bytes. Does anyone know a way?

Nic Cottrell · Accepted Answer

So far I've gone for this option...

I've created a new field in my data (which is unfortunate since it requires significant IO). This script goes through and sets the value for each document.

   db.Example.find({lb: {$exists: false}}).limit(200000).forEach(function (obj) {
     var lengthBytes = encodeURIComponent(obj.text).replace(/%[A-F\d]{2}/g, 'U').length;
     // print("id=" + obj._id + ";lenBytes=" + lengthBytes);
     db.Example.update({ _id: obj._id }, {$set: { lb: NumberInt(lengthBytes)} });
   });

I've done some spot checks and the values match with http://mothereff.in/byte-counter

I can then query long strings with:

  db.Example.find({lb: {$gt: 800}}).limit(20);

Note: the NumberInt forces Mongo to store the length as an int, otherwise it's stored as floating

Is there a way to query Mongo based on the length in bytes of a particular field?

Answers (1)

Related Questions