Best way to manipulate large json objects

Question

We have an application that calls an API every 4 hours and gets a dump of all objects, returned in a json format which are then stored in a file.json

The reason we do this is because we need up to date data and we are not allowed to use the api directly to get small portions of this data and also that we need to do a clean up on it.

There is also another problem, we can't call for only the updated records (which is actually what we need)

The way we are currently handling this is by getting the data, storing in a file, load the previous file into memory and compare the values in order to get only the new and the updated ones, once we get the new and updated we go ahead and insert into MySQL

I am currently looking into a different option, what I was thinking is that since since the new file will contain every single record why not query for the needed objects from the file.json when needed?

The problem with that is that some of these files are larger than 50MB (each file contains one of the related tables, being 6 files in total which complete the full relation) and we can't be loading them into memory every time there is a query, does any one know of a DB system that will allow to query on a file or an easier way to replace the old data with the new one with a quick operation?

user473305 · Accepted Answer

I think the approach you're using already is probably the most practical, but I'm intrigued by your idea of searching the JSON file directly.

Here's how I'd take a stab at implementing this, having worked on a Web application that used the similar approach of searching an XML file on disk rather than a database (and, remarkably, was still fast enough for production use):

Sort the JSON data first. Creating a new master file with the objects reordered to match how they're indexed in the database will maximize the efficiency of a linear search through the data.
Use a streaming JSON parser for searches. This will allow the file to be parsed object-by-object without needing to load the entire document in memory first. If the file is sorted, only half the document on average will need to be parsed for each lookup.

Streaming JSON parsers are rare, but they exist. Salsify has created one for PHP.
Benchmark searching the file directly using the above two strategies. You may discover this is enough to make the application usable, especially if it supports only a small number of users. If not:
Build separate indices on disk. Instead of having the application search the entire JSON file directly, parse it once when it's received and create one or more index files that associate key values with byte offsets into the original file. The application can then search a (much smaller) index file for the object it needs; once it retrieves the matching offset, it can seek immediately to the corresponding JSON object in the master file and parse it directly.
Consider using a more efficient data format. JSON is lightweight, but there may be better options. You might experiment with generating a new master file using serialize to output a "frozen" representation of each parsed JSON object in PHP's native serialization format. The application can then use unserialize to obtain an array or object it can use immediately.

Combining this with the use of index files, especially if they're generated as trees rather than lists, will probably give you about the best performance you can hope for from a simple, purely filesystem-based solution.

Best way to manipulate large json objects

Answers (2)

Related Questions