Fadi
Fadi

Reputation: 2370

Reading big arrays from big json file in php

I know my question has a lot of answers on the internet but it's seems i can't find a good answer for it, so i will try to explain what i have and hope for the best,

so what i'm trying to do is reading a big json file that might be has more complex structure "nested objects with big arrays" than this but for simple example:

{
  "data": {
    "time": [
      1,
      2,
      3,
      4,
      5,
       ...
    ],
    "values": [
      1,
      2,
      3,
      4,
      6,
       ...
    ]
  }
}

this file might be 200M or more, and i'm using file_get_contents() and json_decode() to read the data from the file,

then i put the result in variable and loop over the time and take the time value with the current index to get the corresponding value by index form the values array, then save the time and the value in the database but this taking so much CPU and Memory, is their a better way to do this

a better functions to use, a better json structure to use, or maybe a better data format than json to do this

my code:

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);
        
foreach(data["time"] as $timeIndex => timeValue) {
    saveInDataBase(timeValue, data["values"][timeIndex])
}

thanks in advance for any help

Update 06/29/2020:

i have another more complex json structure example

{
      "data": {
        "set_1": {
          "sub_set_1": {
            "info_1": {
              "details_1": {
                "data_1": [1,2,3,4,5,...],
                "data_2": [1,2,3,4,5,...],
                "data_3": [1,2,3,4,5,...],
                "data_4": [1,2,3,4,5,...],
                "data_5": 10254552
              },
              "details_2": [
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
              ]
            },
            "info_2": {
              "details_1": {
                "data_1": {
                  "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_2": {
                 "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_5": {
                  "text": "some text"
                }
              },
              "details_2": [1,2,3,4,5,...]
            }
          }, ...
        }, ...
      }
    } 

the file size might be around 500MB or More and the arrays inside this json file might have around 100MB of data or more.

and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can't read the file line by line because i need to get any peace of data when i have to,

is python for example more suitable for handling this big data with more efficient than php ?

please if you can provide a detailed answer i think it will be much help for every one that looking to do this big data stuff with php.

Upvotes: 10

Views: 10742

Answers (7)

Max Kaps 4bis.nl
Max Kaps 4bis.nl

Reputation: 1401

JSON is a great format and way better alternative to XML. In the end JSON is almost one on one convertible to XML and back.

Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.

I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.

I have 3 solutions for you:

  1. Native PHP implementation (preferred)

Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while ($reader->read("type")) {
    echo $reader->value(), "\n";
}
$reader->close();

This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.

  1. Let go formats (cons: multiple conversions)

Preprocess file to a different format like XML or CSV. There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.

  1. Use some NoSQL DB (cons: additional complex software to install and maintain)

For example Redis or CouchDB(import json file to couch db-)

Upvotes: 10

Eve
Eve

Reputation: 377

Your problem is basically related to the memory management performed by each specific programming language that you might use in order to access the data from a huge (storage purpose) file.

For example, when you amass the operations by using the code that you just mentioned (as below)

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);

what happens is that the memory used by runtime Zend engine increases too much, because it has to allocate certain memory units to store references about each ongoing file handling involved in your code statement - like keeping also in memory a pointer, not only the real file opened - unless this file gets finally overwritten and the memory buffer released (freed) again. It's no wonder that if you force the execution of both file_get_contents() function that reads the file into a string and also the json_decode() function, you force the interpreter to keep in memory all 3 "things": the file itself, the reference created (the string), and also the structure (the json file).

On the contrary if you break the statement in several ones, the memory stack hold by the first data structure (the file) will be unloaded when the operation of "getting its content" then writing it into another variable (or file) is fully performed. As time as you don't define a variable where to save the data, it will still stay in the memory (as a blob - with no name, no storage address, just content). For this reason, it is much more CPU and RAM effective - when working with big data - to break everything in smaller steps.

So you have first to start by simply rewriting your code as follows:

$somefile = file_get_contents(storage_path("test/ts/ts_big_data.json"));

$data = json_decode($somefile, true);

When first line gets executed, the memory hold by ts_big_data.json gets released (think of it as being purged and made available again to other processes).

When second line gets executed, also $somefile's memory buffer gets released, too. The take away point from this is that instead of always having 3 memory buffers used just to store the data structures, you'll only have 2 at each time, if of course ignoring the other memory used to actually construct the file. Not to say that when working with arrays (and JSON files just exactly arrays they are), that dynamically allocated memory increases dramatically and not linear as we might tend to think. Bottom line is that instead of a 50% loss in performance just on storage allocation for the files (3 big files taking 50% more space than just 2 of them), we better manage to handle in smaller steps the execution of the functions 'touching' these huge files.

In order to understand this, imagine that you access only what is needed at a certain moment in time (this is also a principle called YAGNI -You Aren't Gonna Need It - or similar in the context of Extreme Programming Practices - see reference here https://wiki.c2.com/?YouArentGonnaNeedIt something inherited since the C or Cobol old times.

The next approach to follow is to break the file in more pieces, but in a structured one (relational dependent data structure) as is in a database table / tables.

Obviously, you have to save the data pieces again as blobs, in the database. The advantage is that the retrieval of data in a DB is much more faster than in a file (due to the allocation of indexes by the SQL when generating and updating the tables). A table having 1 or two indexes can be accessed in a lightning fast manner by a structured query. Again, the indexes are pointers to the main storage of the data.

One important topic however is that if you still want to work with the json (content and type of data storage - instead of tables in a DB) is that you cannot update it locally without changing it globally. I am not sure what you meant by reading the time related function values in the json file. Do you mean that your json file is continuously changing? Better break it in several tables so each separate one can change without affecting all the mega structure of the data. Easier to manage, easier to maintain, easier to locate the changes.

My understanding is that best solution would be to split the same file in several json files where you strip down the not needed values. BY THE WAY, DO YOU ACTUALLY NEED ALL THE STORED DATA ??

I wouldn't come now with a code unless you explain me the above issues (so we can have a conversation) and thereafter I will accordingly edit my answer. I wrote yesterday a question related to handling of blobs - and storing in the server - in order to accelerate the execution of a data update in a server using a cron process. My data was about 25MB+ not 500+ as in your case however I must understand the use case for your situation.

One more thing, how was created that file that you must process ? Why do you manage only the final form of it instead of intervening in further feeding it with data ? My opinion is that you might stop storing data into it as previously done (and thus stop adding to your pain) and instead transform its today purpose only into historic data storage from now on then go toward storing the future data in something more elastic (as MongoDB or NoSQL databases).

Probably you don't need so much a code as a solid and useful strategy and way of working with your data first.

Programming comes last, after you decided all the detailed architecture of your web project.

Upvotes: 5

Aabir Hussain
Aabir Hussain

Reputation: 1181

My approach will be reading the JSON FILE in chunks.

If these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.

Once you collect a whole object, you insert it into a db, then go on to the next one.

There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.

Above answer is taken from => Parse large JSON file

Please see the below references, it can be helpful for your case

=> https://laracasts.com/discuss/channels/general-discussion/how-to-open-a-28-gb-json-file-in-php

Upvotes: 1

Maxi
Maxi

Reputation: 433

As you say correctly you won't get around with reading line per line. Using SQL as suggested just moves the problem to another environment. I would personally do it this way:

  1. When a new JSON file comes in, put it in a storage, easiest would be S3 with Storage::disk('s3')->put(...); (https://laravel.com/docs/7.x/filesystem) and put it in a queue. You could use Laravel queue or what I prefer, RabbitMQ. Add to the queue a new entry, like {'job': 'parseMyJSON', 'path': 'https://path-on.s3'}
  2. Create a new server instance that can access the queue
  3. Write a worker instance of your app, that can take a job from the queue. Run it on the new server from 2. Whenever you put the job into the queue, it will get the JSON file from S3 and do the necessary job. Then it will take the next job from the queue, one by one.

If this worker instance is written in Python or PHP you have to test what will work faster. The advantage of this is, that you can scale the workers as how much you need them. And it won't affect the performance of your webapp. I hope this helps you.

Upvotes: 0

user8555937
user8555937

Reputation: 2377

and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can't read the file line by line because i need to get any peace of data when i have to,

It's plain text JSON and you have no indexes, so it's impossible to parse your data without iterating it line-by-line. The solution is to serialize your data once and for all and store it in a database (I'm thinking SQLite for fast setup).

If you mandatory can't store your data in a database, or can't retrieve it in SQLite format, you have no other choice but to create a queue job which will parse it in time.

Upvotes: 0

Nicholas Summers
Nicholas Summers

Reputation: 4756

Try Reducing You Bulk Data Complexity For Faster File I/O

JSON is a great format to store data in, but it comes at the cost of needing to read the entire file to parse it.

Making your data structure simpler but more spread out across several files can allow you to read a file line-by-line which is much faster than all-at-once. This also comes with the benefit of not needing to store the entire file in RAM all at once, so it is more friendly to resource-limited enviroments.

This might look something like this:

objects.json

{
  "data": {
    "times_file": "/some/path/objects/object-123/object-123-times.csv",
    "values_file": "/some/path/objects/object-123/object-123-times.csv"
  }
}

object-123-times.csv

1
2
3
4
...

This would allow you to store your bulk data in a simpler but easier to access format. You could then use something like fgetcsv() to parse each line.

Upvotes: -1

Foued MOUSSI
Foued MOUSSI

Reputation: 4813

You may Split your arrays into chunks using

array_chunk() Function

The array_chunk() function is an inbuilt function in PHP which is used to split an array into parts or chunks of given size depending upon the parameters passed to the function. The last chunk may contain fewer elements than the desired size of the chunk.

Check the examples in this link

Upvotes: -2

Related Questions