Camille R
Camille R

Reputation: 1433

GridFS use filename as index

I'm currently working on a 'simple' photo sytem with mongoDB, using a Replica Set and GridFS.

The principle is simple, I put a lot of photos using GridFS, the client knows the filename, and from the filename I can retrieve the file.

Is GridFS using filename as indexes ? Hopefully yes, I couldn't find it written down in any official doc.

My stats are :

     {
        "ns" : "photos.socialphotos.files",
        "count" : 758086,
        "size" : 168295128,
        "avgObjSize" : 222.00004748801587,
        "storageSize" : 220647424,
        "numExtents" : 15,
        "nindexes" : 2,
        "lastExtentSize" : 43311104,
        "paddingFactor" : 1,
        "flags" : 1,
        "totalIndexSize" : 125084624,
        "indexSizes" : {
            "_id_" : 22925504,
            "filename_1_uploadDate_1" : 102159120
        },
        "ok" : 1
    }

EDIT : by reIndex() the collections, I won 30 Go, but it's still way too high..

My indexes are :

{
    "v" : 1,
    "key" : {
        "_id" : 1
    },
    "ns" : "photos.socialphotos.files",
    "name" : "_id_"
},
{
    "v" : 1,
    "key" : {
        "filename" : 1,
        "uploadDate" : 1
    },
    "ns" : "photos.socialphotos.files",
    "name" : "filename_1_uploadDate_1"
}

Indexes size :

"keysPerIndex" : {
    "photos.socialphotos.files.$_id_" : 758086,
    "photos.socialphotos.files.$filename_1_uploadDate_1" : 758086
}

I never use _id_ as I don't store it, is it OK to remove it ? Index size is 125084624 which means I should have almost all my photos in RAM, which is a bit strange ?

Additional questions :

  1. Statistics : mongostats is the basics, is there another good tool for monitoring, or do I have to create my own tool ?

  2. Faults : I could see a LOT (around 100 a sec) when I'm doing lots of inserts, I have nothing on the console... where should I investigate ?

  3. Connecion Pool with JAVA/Tomcat : I'm using a simple Tomcat webapp connection to MongoDB, would you recommand to open a new connection to mongoDB for each request (I guess not) or to keep a reference as a singleton on the Mongo object (with Holder for example) or using a good pool, but I didn't find a standard one ?

Thank you very much !

Upvotes: 5

Views: 4391

Answers (2)

William Z
William Z

Reputation: 11129

To address your questions:

1) When you initialize a GridFS collection using the Java driver, that driver will automatically create indexes on the .files and the .chunks collections.

2) MongoDB requires that you have an '_id' field and a unique '_id' index. The default '_id' is only 12 bytes long -- there's really no significant overhead from having it present.

Reference: http://www.mongodb.org/display/DOCS/Object+IDs

3) The stats on the "filename_1_uploadDate_1" index only indicate the size of the index. This index contains only the contents of the filename and the upload data fields - it does not contain any of the photo data itself. You want to have the active portion of the index fit in RAM for performance reasons.

References:

4) If you want to have advanced statistics and monitoring, enroll your system in the free MMS monitoring system provided by 10gen. For more information, start here: https://mms.10gen.com/help/

5) Page faults are normal when loading in new data. MongoDB uses memory-mapped files, so every time you write to a new location within the data file, the OS will need to fault in that page.

For more information about memory mapped files, look here: http://docs.mongodb.org/manual/faq/storage/

6) The MongoDB Java driver provides its own connection pool. Unless you're doing a really high-performance application, you're probably best off using the Mongo object as a singleton.

Upvotes: 4

Aafreen Sheikh
Aafreen Sheikh

Reputation: 5074

Looks like you have to have _id field in each 'regular' document:

http://www.mongodb.org/display/DOCS/Object+IDs

If you don't specify how it is generated, MongoDB will auto-generate it using BsonObjectId datatype and also automatically create an index on it..It is because Mongo is sure about the uniqueness of this field. But if you don't want to use it..like in your case, you can put filename+dateupload in _id field and let Mongo handle the index on it..

Also, what you have mentioned about..the 125084624 thing, that's the size of the index on _id. Total size of your photos might be much more.. 125MB in the RAM looks harmless to me.
I don't know how you could better investigate faults, but..I'm assuming you are using 64-bit. If it's 32 bit, then DB size is limited to 2GB..Your inserts will start failing at some point before that..

Anyway, regarding connections, try and test with a few requests, once with individual connections and once with singleton.. I'm guessing a singleton should perform better. To test the performance, or carry out a load-test, you might use Jmeter:

http://jmeter.apache.org/

Upvotes: 2

Related Questions