Hossein
Hossein

Reputation: 1768

C# Batch deserialization of a lot of json files

I have 200.000 json files on file system.

Deserializing them one-by-one and putting them in a List takes about 4 minutes.

I am looking for fastest way to deserialize them or a way to deserialize them all at once.

Code Sample

The code i am using is somthing like this:

var files = Directory.GetFiles(@"C:\Data","*.json");
        var list = new List<ParsedData>();
        var dt1 = DateTime.Now;
        foreach(var file in files)
        {
            using (StreamReader filestr = File.OpenText(file))
            {
                JsonSerializer serializer = new JsonSerializer();
                var  data= (ParsedData)serializer.Deserialize(filestr, typeof(ParsedData));
                list.Add(data);
            }
        }
        var dt2 = DateTime.Now;

        Console.WriteLine((dt2 - dt1).TotalMilliseconds);

JSON format

And the json sample is:

{

  "channel_name": "@channel",
  "message": "",
  "text": "",
  "date": "2015/10/09 12:22:48",
  "views": "83810",
  "forwards": "0",
  "raw_text": "",
  "keywords_marked": "",
  "id": 973,
  "media": "1.jpg"
}

Upvotes: 1

Views: 962

Answers (1)

Hazrelle
Hazrelle

Reputation: 856

You can trying using a Parallel.Foreach():

            var files = Directory.GetFiles(@"C:\Data", "*.json");
            var list = new ConcurrentBag<ParsedData>();
            var dt1 = DateTime.Now;
            Parallel.ForEach(files, (file) =>
            {
                var filestr = File.ReadAllText(file);
                var data = JsonSerializer.Deserialize<ParsedData>(filestr);
                list.Add(data);
            });
            var dt2 = DateTime.Now;

            Console.WriteLine((dt2 - dt1).TotalMilliseconds);

EDIT: Remove var files = Directory.GetFiles(@"C:\Data", "*.json"); and then try directly:

            Parallel.ForEach(Directory.EnumerateFiles(@"C:\Data", "*.json"), (file) =>
            {
                var filestr = File.ReadAllText(file);
                var data = JsonSerializer.Deserialize<ParsedData>(filestr);
                list.Add(data);
            });

But with 200000 files 50sec seems pretty descent.

If you use .NET6, you may use:

Parallel.ForEachAsync( ... async(file) => {
  var fs = new FileStream(file, FileMode.Open);
  var data = await JsonSerializer.DeserializeAsync<ParsedData>(fs);
  list.Add(data);
});

Upvotes: 2

Related Questions