Reputation: 563
I wrote a short python script to extract some population data from an API and store it into a csv. An example for what the API returns can be found here. The "Data" contains more than 8000 observations so I am looking for an efficient way to access it. The code I wrote is working, but takes hours to run. Therefore my question, is there any way to loop through this JSON more efficiently, or to extract the needed data without looping through every observation?
import requests
api_base = "http://dw.euro.who.int/api/v3/data_sets/HFAMDB/HFAMDB_8"
with open("population.csv", "w") as outfile:
outfile.write("country,year,group,fullname,count\n")
for i in range(32,51):
response = requests.get(api_base+str(i))
print(api_base+str(i))
for observation in response.json()['data']:
count = observation["value"]["numeric"]
country = observation["dimensions"]["COUNTRY"]
year = observation["dimensions"]["YEAR"]
group = observation["dimensions"]["AGE_GRP_6"]
fullGroupName = response.json()['full_name']
if observation["dimensions"]["SEX"] == "ALL":
outfile.write("{},{},{},{},{}\n".format(country, year, group, fullGroupName, count))
Thank you in advance for your answers.
Upvotes: 2
Views: 2003
Reputation: 3599
Although Stefan Pochmann has already answered your question, I think it's worth to mention how you could have figured out what the problem is for yourself.
One way would be to use a profiler, for example Python's cProfile, which is included in the standard library.
Assuming that your script is called slow_download.py
, you can limit the range in your loop to, for example, range(32, 33)
and execute it in the following way:
python3 -m cProfile -s cumtime slow_download.py
The -s cumtime
sorts the calls by cumulative time.
The result would be:
http://dw.euro.who.int/api/v3/data_sets/HFAMDB/HFAMDB_832
222056 function calls (219492 primitive calls) in 395.444 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
122/1 0.005 0.000 395.444 395.444 {built-in method builtins.exec}
1 49.771 49.771 395.444 395.444 py2.py:1(<module>)
9010 0.111 0.000 343.904 0.038 models.py:782(json)
9010 0.078 0.000 332.900 0.037 __init__.py:271(loads)
9010 0.091 0.000 332.801 0.037 decoder.py:334(decode)
9010 332.607 0.037 332.607 0.037 decoder.py:345(raw_decode)
...
This clearly suggests that the problem is related to json()
and related methods: loads()
and raw_decode()
.
Upvotes: 1
Reputation: 28596
Well don't call response.json()
over and over and over again unnecessarily.
Instead of
for observation in response.json()['data']:
fullGroupName = response.json()['full_name']
do
data = response.json()
for observation in data['data']:
fullGroupName = data['full_name']
After this change the whole thing takes my PC about 33 seconds. And pretty much all of that is for the requests. Maybe you could speed that up further by using parallel requests if that's ok for the site.
Upvotes: 2
Reputation: 16733
If the data is really large, dump the data in mongodb, and query whatever you want efficiently.
Upvotes: 0