Reputation: 1510
I have 2GB files (9 of them) which contains approximately 12M records of strings that i want to insert each one as a document to local mongodb (windows).
Now i'm reading line by line and inserting every second line (the first is unnecessary header) like this:
bool readingFlag = false;
foreach (var line in File.ReadLines(file))
{
if (readingflag)
{
String document = "{'read':'" + line + "'}";
var documnt = new BsonDocument(
MongoDB
.Bson
.Serialization
.BsonSerializer
.Deserialize<BsonDocument>(document));
await collection.InsertOneAsync(documnt);
readingflag = false;
}
else
{
readingflag = true;
}
}
This method is working but not as fast as i expected. I'm now in the middle of the file and i assume it will end in about 4 hours for just one file. (40 hours for all my data)
I think that my bottleneck is the file reading but since it is very big file VS doesn't let my load it to memory (out of memory exception).
Is there any other way that i'm missing here?
Upvotes: 1
Views: 2299
Reputation: 1
In my experiments I found Parallel.ForEach(File.ReadLines("path"))
to be the fastest.
File size was about 42 GB. I also tried batching a set of 100 lines and save the batch but was slower than Parallel.ForEach
.
Another example: Read large txt file multithreaded?
Upvotes: 0
Reputation: 9473
I think we could utilize those things:
TextData
to push serialization to other thread You can play with limit at once - as this depend of amount of data read from file
public class TextData{
public ObjectId _id {
get;
set;
}
public string read {
get;
set;
}
}
public class Processor{
public async void ProcessData() {
var client = new MongoClient("mongodb://localhost:27017");
var database = client.GetDatabase("test");
var collection = database.GetCollection < TextData > ("Yogevnn");
var readingflag = false;
var listOfDocument = new List < TextData > ();
var limiAtOnce = 100;
var current = 0;
foreach(var line in File.ReadLines( @ "E:\file.txt")) {
if (readingflag) {
var dataToInsert = new TextData {
read = line
};
listOfDocument.Add(dataToInsert);
readingflag = false;
Console.WriteLine($ "Current position: {current}");
if (++current == limiAtOnce) {
current = 0;
Console.WriteLine($ "Inserting data");
var listToInsert = listOfDocument;
var t = new Task(() = > {
Console.WriteLine($ "Inserting data START");
collection.InsertManyAsync(listToInsert);
Console.WriteLine($ "Inserting data FINISH");
});
t.Start();
listOfDocument = new List < TextData > ();
}
} else {
readingflag = true;
}
}
// insert remainder
await collection.InsertManyAsync(listOfDocument);
}
}
Any comments welcome!
Upvotes: 1