Aleksandar
Aleksandar

Reputation: 51

The speed of mongoimport while using -jsonArray is very slow

I have a 15GB file with more than 25 milion rows, which is in this json format(which is accepted by mongodb for importing:

[
    {"_id": 1, "value": "\u041c\..."}
    {"_id": 2, "value": "\u041d\..."}
    ...
]

When I'm trying to import it in mongodb with the following command I get speed of only 50 rows per second which is really slow for me.

mongoimport --db wordbase --collection sentences --type json --file C:\Users\Aleksandar\PycharmProjects\NLPSeminarska\my_file.json -jsonArray

When I tried to insert the data into the collection by using python with pymongo the speed was even worse. I also tried increasing the priority of the process but it didn't make any difference.

The next thing that I tried is the same thing but without using -jsonArray and although I got a big speed increase(~4000/sec), it said that the BSON representation of the supplied JSON is too large.

I also tried splitting the file into 5 separate files and importing them from separate consoles into the same collection, but I get speed decrease of all of them to about 20 documents/sec.

While I searched all over the web I saw that people had speeds of over 8K documents/sec and I can't see what do I do wrong.

Is there a way to speed this thing up, or should I convert the whole json file to bson and import it that way, and if so which is the correct way to do both the converting and the importing?

Huge thanks.

Upvotes: 4

Views: 1937

Answers (1)

nessa.gp
nessa.gp

Reputation: 1864

I have the exact same problem with a 160Gb dump file. It took me two days to load 3% of the original file with -jsonArray and 15 minutes with these changes.

First, remove the initial [ and trailing ] characters:

sed 's/^\[//; s/\]$/' -i filename.json

Then import without the -jsonArray option:

mongoimport --db "dbname" --collection "collectionname" --file filename.json

If the file is huge, sed will take a really long time and maybe you run into storage problems. You can use this C program instead (not written by me, all glory to @guillermobox):

int main(int argc, char *argv[])
{
    FILE * f;
    const size_t buffersize = 2048;
    size_t length, filesize, position;
    char buffer[buffersize + 1];

    if (argc < 2) {
        fprintf(stderr, "Please provide file to mongofix!\n");
        exit(EXIT_FAILURE);
    };

    f = fopen(argv[1], "r+");

    /* get the full filesize */
    fseek(f, 0, SEEK_END);
    filesize = ftell(f);

    /* Ignore the first character */
    fseek(f, 1, SEEK_SET);

    while (1) {
        /* read chunks of buffersize size */
        length = fread(buffer, 1, buffersize, f);
        position = ftell(f);

        /* write the same chunk, one character before */
        fseek(f, position - length - 1, SEEK_SET);
        fwrite(buffer, 1, length, f);

        /* return to the reading position */
        fseek(f, position, SEEK_SET);

        /* we have finished when not all the buffer is read */
        if (length != buffersize)
            break;
    }

    /* truncate the file, with two less characters */
    ftruncate(fileno(f), filesize - 2);

    fclose(f);

    return 0;
};

P.S.: I don't have the power to suggest a migration of this question but I think this could be helpful.

Upvotes: 2

Related Questions