Yogevnn
Yogevnn

Reputation: 1510

Insert huge files (2G) to mongodb

I have 2GB files (9 of them) which contains approximately 12M records of strings that i want to insert each one as a document to local mongodb (windows).

Now i'm reading line by line and inserting every second line (the first is unnecessary header) like this:

bool readingFlag = false;
foreach (var line in File.ReadLines(file))
{
    if (readingflag)
    {
        String document = "{'read':'" + line + "'}";
        var documnt = new BsonDocument(
             MongoDB
             .Bson
             .Serialization
             .BsonSerializer
             .Deserialize<BsonDocument>(document));

        await collection.InsertOneAsync(documnt);
        readingflag = false;
    }
    else
    {
        readingflag = true;
    }
}

This method is working but not as fast as i expected. I'm now in the middle of the file and i assume it will end in about 4 hours for just one file. (40 hours for all my data)

I think that my bottleneck is the file reading but since it is very big file VS doesn't let my load it to memory (out of memory exception).

Is there any other way that i'm missing here?

Upvotes: 1

Views: 2299

Answers (2)

Nishat Mazhar
Nishat Mazhar

Reputation: 1

In my experiments I found Parallel.ForEach(File.ReadLines("path")) to be the fastest. File size was about 42 GB. I also tried batching a set of 100 lines and save the batch but was slower than Parallel.ForEach.

Another example: Read large txt file multithreaded?

Upvotes: 0

profesor79
profesor79

Reputation: 9473

I think we could utilize those things:

  1. Get some lines and add in a bunch by insert many
  2. insert data on separate thread as we don't need to wait for finish
  3. use a typed class TextData to push serialization to other thread

You can play with limit at once - as this depend of amount of data read from file

public class TextData{
    public ObjectId _id {
        get;
        set;
    }
    public string read {
        get;
        set;
    }
}

public class Processor{
    public async void ProcessData() {
        var client = new MongoClient("mongodb://localhost:27017");
        var database = client.GetDatabase("test");

        var collection = database.GetCollection < TextData > ("Yogevnn");
        var readingflag = false;
        var listOfDocument = new List < TextData > ();
        var limiAtOnce = 100;
        var current = 0;

        foreach(var line in File.ReadLines( @ "E:\file.txt")) {
            if (readingflag) {
                var dataToInsert = new TextData {
                    read = line
                };
                listOfDocument.Add(dataToInsert);
                readingflag = false;
                Console.WriteLine($ "Current position: {current}");

                if (++current == limiAtOnce) {
                    current = 0;
                    Console.WriteLine($ "Inserting data");
                    var listToInsert = listOfDocument;

                    var t = new Task(() =  > {
                                Console.WriteLine($ "Inserting data START");
                                collection.InsertManyAsync(listToInsert);
                                Console.WriteLine($ "Inserting data FINISH");
                            });
                    t.Start();
                    listOfDocument = new List < TextData > ();
                }
            } else {
                readingflag = true;
            }
        }

        // insert remainder
        await collection.InsertManyAsync(listOfDocument);
    }
}

Any comments welcome!

Upvotes: 1

Related Questions