Bobipuegi
Bobipuegi

Reputation: 563

Efficient looping through large JSON-Files

I wrote a short python script to extract some population data from an API and store it into a csv. An example for what the API returns can be found here. The "Data" contains more than 8000 observations so I am looking for an efficient way to access it. The code I wrote is working, but takes hours to run. Therefore my question, is there any way to loop through this JSON more efficiently, or to extract the needed data without looping through every observation?

import requests 
api_base = "http://dw.euro.who.int/api/v3/data_sets/HFAMDB/HFAMDB_8"

with open("population.csv", "w") as outfile:
   outfile.write("country,year,group,fullname,count\n")
   for i in range(32,51):
      response = requests.get(api_base+str(i))
      print(api_base+str(i))
      for observation in response.json()['data']:
          count = observation["value"]["numeric"]
          country = observation["dimensions"]["COUNTRY"]
          year = observation["dimensions"]["YEAR"]
          group = observation["dimensions"]["AGE_GRP_6"]
          fullGroupName = response.json()['full_name']
          if observation["dimensions"]["SEX"] == "ALL":
              outfile.write("{},{},{},{},{}\n".format(country, year, group, fullGroupName, count))

Thank you in advance for your answers.

Upvotes: 2

Views: 2003

Answers (3)

Wojciech Walczak
Wojciech Walczak

Reputation: 3599

Although Stefan Pochmann has already answered your question, I think it's worth to mention how you could have figured out what the problem is for yourself.

One way would be to use a profiler, for example Python's cProfile, which is included in the standard library.

Assuming that your script is called slow_download.py, you can limit the range in your loop to, for example, range(32, 33) and execute it in the following way:

python3 -m cProfile -s cumtime slow_download.py

The -s cumtime sorts the calls by cumulative time.

The result would be:

   http://dw.euro.who.int/api/v3/data_sets/HFAMDB/HFAMDB_832
          222056 function calls (219492 primitive calls) in 395.444 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    122/1    0.005    0.000  395.444  395.444 {built-in method builtins.exec}
        1   49.771   49.771  395.444  395.444 py2.py:1(<module>)
     9010    0.111    0.000  343.904    0.038 models.py:782(json)
     9010    0.078    0.000  332.900    0.037 __init__.py:271(loads)
     9010    0.091    0.000  332.801    0.037 decoder.py:334(decode)
     9010  332.607    0.037  332.607    0.037 decoder.py:345(raw_decode)
     ...

This clearly suggests that the problem is related to json() and related methods: loads() and raw_decode().

Upvotes: 1

Stefan Pochmann
Stefan Pochmann

Reputation: 28596

Well don't call response.json() over and over and over again unnecessarily.

Instead of

  for observation in response.json()['data']:
      fullGroupName = response.json()['full_name']

do

  data = response.json()
  for observation in data['data']:
      fullGroupName = data['full_name']

After this change the whole thing takes my PC about 33 seconds. And pretty much all of that is for the requests. Maybe you could speed that up further by using parallel requests if that's ok for the site.

Upvotes: 2

hspandher
hspandher

Reputation: 16733

If the data is really large, dump the data in mongodb, and query whatever you want efficiently.

Upvotes: 0

Related Questions