Reputation: 1768
I have 200.000 json files on file system.
Deserializing them one-by-one and putting them in a List takes about 4 minutes.
I am looking for fastest way to deserialize them or a way to deserialize them all at once.
Code Sample
The code i am using is somthing like this:
var files = Directory.GetFiles(@"C:\Data","*.json");
var list = new List<ParsedData>();
var dt1 = DateTime.Now;
foreach(var file in files)
{
using (StreamReader filestr = File.OpenText(file))
{
JsonSerializer serializer = new JsonSerializer();
var data= (ParsedData)serializer.Deserialize(filestr, typeof(ParsedData));
list.Add(data);
}
}
var dt2 = DateTime.Now;
Console.WriteLine((dt2 - dt1).TotalMilliseconds);
JSON format
And the json sample is:
{
"channel_name": "@channel",
"message": "",
"text": "",
"date": "2015/10/09 12:22:48",
"views": "83810",
"forwards": "0",
"raw_text": "",
"keywords_marked": "",
"id": 973,
"media": "1.jpg"
}
Upvotes: 1
Views: 962
Reputation: 856
You can trying using a Parallel.Foreach()
:
var files = Directory.GetFiles(@"C:\Data", "*.json");
var list = new ConcurrentBag<ParsedData>();
var dt1 = DateTime.Now;
Parallel.ForEach(files, (file) =>
{
var filestr = File.ReadAllText(file);
var data = JsonSerializer.Deserialize<ParsedData>(filestr);
list.Add(data);
});
var dt2 = DateTime.Now;
Console.WriteLine((dt2 - dt1).TotalMilliseconds);
EDIT:
Remove var files = Directory.GetFiles(@"C:\Data", "*.json");
and then try directly:
Parallel.ForEach(Directory.EnumerateFiles(@"C:\Data", "*.json"), (file) =>
{
var filestr = File.ReadAllText(file);
var data = JsonSerializer.Deserialize<ParsedData>(filestr);
list.Add(data);
});
But with 200000 files 50sec seems pretty descent.
If you use .NET6, you may use:
Parallel.ForEachAsync( ... async(file) => {
var fs = new FileStream(file, FileMode.Open);
var data = await JsonSerializer.DeserializeAsync<ParsedData>(fs);
list.Add(data);
});
Upvotes: 2