JakeD
JakeD

Reputation: 457

Deserializing many Json files using Newtonsoft ToObject

Is there anything terribly inefficient here? It seems like this process is taking way longer than it should. I am parsing many JSON files each with a JsonArray of objects. Maybe someone with more experience could point out an error in this method of parsing the JSON into objects, thereby saving me a ton of time.

Also, memory usage slowly creeps upwards MB by MB sometimes causing outofmemoryexceptions..

public void Parse(){

    using (BabysFirstsUsersDataEntities db = new BabysFirstsUsersDataEntities()
    {

           foreach (var path in Directory.EnumerateFiles(@"C:\\examplepath\").OrderBy(f => f))
           {
                    string jsonString = System.IO.File.ReadAllText(path);
                    JToken tok = JObject.Parse(jsonString);
                    Debug.WriteLine("path: " + path);

                    foreach (var x in tok.First.First)
                    {

                            JsonUserImageDTO jdto = x.ToObject<JsonUserImageDTO>();
                            UserImageList uil = jdto.ToDataModel();

                            if (uil.REID != null)
                                db.UserImageLists.Add(uil);    

                    }
            }
      db.SaveChanges();              
    }
}

An example of what one of the JSON strings in each .json file looks like below. Note that there are around 1000 of these files and each can have thousands of such entries:

  {
  "results": [
    {
      "ACL": {
        "asdf": {
          "read": true,
          "write": true
        },
        "role:admin": { "read": true }
      },
      "REID": "exampleID",
      "createdAt": "datetime-string",
      "email": "example",
      "objectId": "example",
      "updatedAt": "datetimestring",
      "urlCount": 1,
      "urlList": [ "exampleurl" ]
    },
    {
      "ACL": {
        "asdf": {
          "read": true,
          "write": true
        },
        "role:admin": { "read": true }
      },
      "REID": "exampleID",
      "createdAt": "datetime-string",
      "email": "example",
      "objectId": "example",
      "updatedAt": "datetimestring",
      "urlCount": 1,
      "urlList": [ "exampleurl" ]
    }
   ]
  }

Upvotes: 0

Views: 525

Answers (3)

dbc
dbc

Reputation: 117284

Have you actually profiled your code? See Erik Lippert's performance rant: Use a profiler or other analysis tool to determine empirically where the bottleneck is before you start investigating alternatives. For instance, you actual performance problem may be somewhere in the BabysFirstsUsersDataEntities db class.

That being said, my immediate reaction is that you have too many intermediate representations of your data, the construction, population and garbage collection of which all take time. These include:

  • The jsonString which may be large enough to go on the large object heap, and thus permanently impair the performance and memory use of your process.
  • The JToken tok representation of your entire JSON hierarchy.
  • Each individual JsonUserImageDTO.

What I would suggest is to eliminate as many of these intermediate representations as possible. As suggested in the documentation you should load directly from a stream rather that loading to a string and parsing that string.

You can also eliminate the JToken tok by populating your data model directly. Let's say your BabysFirstsUsersDataEntities looks like this (I'm just guessing here):

public class BabysFirstsUsersDataEntities
{
    public BabysFirstsUsersDataEntities() { this.UserImageLists = new List<UserImageList>(); }

    public List<UserImageList> UserImageLists { get; set; }
}

public class UserImageList
{
    public string email { get; set; }
    public List<string> urlList;
}

And your DTO model looks something like this model provided by http://json2csharp.com/:

public class RootObjectDTO
{
    public ICollection<JsonUserImageDTO> results { get; set; }
}

public class JsonUserImageDTO
{
    public ACL ACL { get; set; }
    public string REID { get; set; }
    public string createdAt { get; set; }
    public string email { get; set; }
    public string objectId { get; set; }
    public string updatedAt { get; set; }
    public int urlCount { get; set; }
    public List<string> urlList { get; set; }

    public UserImageList ToDataModel()
    {
        return new UserImageList { email = email, urlList = urlList };
    }
}

public class Asdf
{
    public bool read { get; set; }
    public bool write { get; set; }
}

public class RoleAdmin
{
    public bool read { get; set; }
}

public class ACL
{
    public Asdf asdf { get; set; }

    [JsonProperty("role:admin")]
    public RoleAdmin RoleAdmin { get; set; }
}

Then create the following generic ConvertingCollection<TIn, TOut> utility class:

public class ConvertingCollection<TIn, TOut> : BaseConvertingCollection<TIn, TOut, ICollection<TIn>>
{
    readonly Func<TOut, TIn> toInner;

    public ConvertingCollection(Func<ICollection<TIn>> getCollection, Func<TIn, TOut> toOuter, Func<TOut, TIn> toInner)
        : base(getCollection, toOuter)
    {
        if (toInner == null)
            throw new ArgumentNullException();
        this.toInner = toInner;
    }

    protected TIn ToInner(TOut outer) { return toInner(outer); }

    public override void Add(TOut item)
    {
        Collection.Add(ToInner(item));
    }

    public override void Clear()
    {
        Collection.Clear();
    }

    public override bool IsReadOnly { get { return Collection.IsReadOnly; } }

    public override bool Remove(TOut item)
    {
        return Collection.Remove(ToInner(item));
    }

    public override bool Contains(TOut item)
    {
        return Collection.Contains(ToInner(item));
    }
}

public abstract class BaseConvertingCollection<TIn, TOut, TCollection> : ICollection<TOut>
where TCollection : ICollection<TIn>
{
    readonly Func<TCollection> getCollection;
    readonly Func<TIn, TOut> toOuter;

    public BaseConvertingCollection(Func<TCollection> getCollection, Func<TIn, TOut> toOuter)
    {
        if (getCollection == null || toOuter == null)
            throw new ArgumentNullException();
        this.getCollection = getCollection;
        this.toOuter = toOuter;
    }

    protected TCollection Collection { get { return getCollection(); } }

    protected TOut ToOuter(TIn inner) { return toOuter(inner); }

    #region ICollection<TOut> Members

    public abstract void Add(TOut item);

    public abstract void Clear();

    public virtual bool Contains(TOut item)
    {
        var comparer = EqualityComparer<TOut>.Default;
        foreach (var member in Collection)
            if (comparer.Equals(item, ToOuter(member)))
                return true;
        return false;
    }

    public void CopyTo(TOut[] array, int arrayIndex)
    {
        foreach (var item in this)
            array[arrayIndex++] = item;
    }

    public int Count { get { return Collection.Count; } }

    public abstract bool IsReadOnly { get; }

    public abstract bool Remove(TOut item);

    #endregion

    #region IEnumerable<TOut> Members

    public IEnumerator<TOut> GetEnumerator()
    {
        foreach (var item in Collection)
            yield return ToOuter(item);
    }

    #endregion

    #region IEnumerable Members

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    #endregion
}

You can now populate your db directly as follows:

        var rootDTO = new RootObjectDTO
        {
            results = new ConvertingCollection<UserImageList, JsonUserImageDTO>(() => db.UserImageLists, (x) => { throw new NotImplementedException(); }, (x) => x.ToDataModel())
        };

        using (var stream = File.Open(path, FileMode.Open))
        using (var reader = new StreamReader(stream))
        {
            JsonSerializer.CreateDefault().Populate(reader, rootDTO);
        }

By populating a preallocated rootDTO and ConvertingCollection<UserImageList, JsonUserImageDTO>, your db.UserImageLists will get populated with the contents of the JSON with fewer intermediate representations.

Upvotes: 1

Mike Hixson
Mike Hixson

Reputation: 5199

It looks like there could be several places that could causing the slowness.

  1. Deserialzing JSON
  2. Transforming the object twice (jdto, then uil)
  3. Saving to the database

It may be worth profiling the code to find out exactly what part is taking longer than you'd expect. That said there are some things you can do to generally improve this code.

  1. Deserialize from a steam instead of a string. The way you have it, you basically have the object in memory twice-- once as a string, then once as tok. See the second example in the docs for how to use a stream. Actually, in your case you the same information in memory 4 times -- the string, tok, jdto, and uil. Which brings me to the next point..
  2. Try to eliminate some of the intermediate representations of your object. Generally, the more objects you have laying around, the more time you will spend waiting on the GC.
  3. Move the filtering on the path name to the part where you call EnumerateFiles(). There is no sense in deserializing a file if you are not going to do anything with it.

Upvotes: 1

Marcus Urbano
Marcus Urbano

Reputation: 141

You can create the objects and then deserialize them. Example:

JsonConvert.DeserializeObject<RootObject>(jsonString);

public class Asdf
{
    public bool read { get; set; }
    public bool write { get; set; }
}

public class RoleAdmin
{
    public bool read { get; set; }
}

public class ACL
{
    public Asdf asdf { get; set; }
    public RoleAdmin { get; set; }
}

public class Result
{
    public ACL ACL { get; set; }
    public string REID { get; set; }
    public string createdAt { get; set; }
    public string email { get; set; }
    public string objectId { get; set; }
    public string updatedAt { get; set; }
    public int urlCount { get; set; }
    public List<string> urlList { get; set; }
}

public class RootObject
{
    public List<Result> results { get; set; }
}

Upvotes: 0

Related Questions