Tlacaelel Ramon Luis
Tlacaelel Ramon Luis

Reputation: 472

Best way to manipulate large json objects

We have an application that calls an API every 4 hours and gets a dump of all objects, returned in a json format which are then stored in a file.json

The reason we do this is because we need up to date data and we are not allowed to use the api directly to get small portions of this data and also that we need to do a clean up on it.

There is also another problem, we can't call for only the updated records (which is actually what we need)

The way we are currently handling this is by getting the data, storing in a file, load the previous file into memory and compare the values in order to get only the new and the updated ones, once we get the new and updated we go ahead and insert into MySQL

I am currently looking into a different option, what I was thinking is that since since the new file will contain every single record why not query for the needed objects from the file.json when needed?

The problem with that is that some of these files are larger than 50MB (each file contains one of the related tables, being 6 files in total which complete the full relation) and we can't be loading them into memory every time there is a query, does any one know of a DB system that will allow to query on a file or an easier way to replace the old data with the new one with a quick operation?

Upvotes: 1

Views: 5272

Answers (2)

Tlacaelel Ramon Luis
Tlacaelel Ramon Luis

Reputation: 472

I ended up doing my own processing method.

I got a json dump of all records which I then processed into single files with each one having all its related records in it, kind of like a join, to avoid the indexing of these files to be long I created multiple subfolders for a block of records, while creating these files I started building an index files which pointed to the directory location of the record which is a tiny file, now every time there is a query I just load the index file into memory which is under 1 MB I then check if the index key exists which is the master key of the record, if it does I then have the location of the file which I then load into memory and has all the required information to use in the application.

The query for these files ended up being a lot faster than querying the DB which works for what we need.

Thank you all for your input as it helped me decide which way to go.

Upvotes: 1

user473305
user473305

Reputation:

I think the approach you're using already is probably the most practical, but I'm intrigued by your idea of searching the JSON file directly.

Here's how I'd take a stab at implementing this, having worked on a Web application that used the similar approach of searching an XML file on disk rather than a database (and, remarkably, was still fast enough for production use):

  • Sort the JSON data first. Creating a new master file with the objects reordered to match how they're indexed in the database will maximize the efficiency of a linear search through the data.

  • Use a streaming JSON parser for searches. This will allow the file to be parsed object-by-object without needing to load the entire document in memory first. If the file is sorted, only half the document on average will need to be parsed for each lookup.

    Streaming JSON parsers are rare, but they exist. Salsify has created one for PHP.

  • Benchmark searching the file directly using the above two strategies. You may discover this is enough to make the application usable, especially if it supports only a small number of users. If not:

  • Build separate indices on disk. Instead of having the application search the entire JSON file directly, parse it once when it's received and create one or more index files that associate key values with byte offsets into the original file. The application can then search a (much smaller) index file for the object it needs; once it retrieves the matching offset, it can seek immediately to the corresponding JSON object in the master file and parse it directly.

  • Consider using a more efficient data format. JSON is lightweight, but there may be better options. You might experiment with generating a new master file using serialize to output a "frozen" representation of each parsed JSON object in PHP's native serialization format. The application can then use unserialize to obtain an array or object it can use immediately.

    Combining this with the use of index files, especially if they're generated as trees rather than lists, will probably give you about the best performance you can hope for from a simple, purely filesystem-based solution.

Upvotes: 3

Related Questions