Reputation: 20362
I am trying to loop through four URLs in a list, scrape the contents of each URL, and save each as a separate CSV. I think my code below is close, but it doesn't really seem to parse the JSON strings into a human-readable format. Also, the headers are missing.
Here is my hacked together code.
import urllib
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
#print(item)
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
print(r.json)
df = pd.DataFrame(r)
df.head()
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
#urllib.request.urlretrieve(df,DOWNLOAD_PATH)
r = requests.get(item)
with open(DOWNLOAD_PATH,'wb') as f:
f.write(r.content)
except Exception as e: print(e)
This is what the data looks like. Is this right? I thought the data would look a lot cleaner if it was converted from JSON to CSV.
Upvotes: 0
Views: 825
Reputation: 1584
You are close, here's what you need to change:
df = pd.read_json(text, lines=True)
- for this make sure to specify lines=True
because some of your data contains \n
charactersdf.to_csv(file)
All in all, there are some things in your code that could be removed, e.g. you're calling requests.get
twice for no real reason, which slows your code down substantially.
import requests
import pandas as pd
all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
df = pd.read_json(r.text, lines=True)
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
with open(DOWNLOAD_PATH,'wb') as f:
df.to_csv(f)
except Exception as e: print(e)
Upvotes: 1
Reputation: 186
I am not sure if you are asking for opinion. Is CSV necessary? Or are you just trying to save it in a readable (and preferrably, programmably extractable) format?
Anyways, here is the way to export one result to a .csv
file:
# Required imports
import json
import csv
"""I am assuming you already made the request and got the response"""
data_set = r.json() # An iterable of python dict objects
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
# (Please take enough care to not reveal personal info above)
# getting the headers
headers = data_set[0].keys()
with open(DOWNLOAD_PATH, "wt", newline="") as file:
writer = csv.writer(file)
writer.writerow(headers) # writing the headers
for data in data_set: # writing the values
write.writerow(data.values())
And here is the way to export one result to a .json
file:
import json
data_set = r.json()
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.json'
with open(DOWNLOAD_PATH, "wt") as file:
json.dump(data_set, file)
You can use whichever file is more readable to you. Note that the above methods were for a single request result. You will have to do the above for each result you extract from the website.
Credits for the CSV method: GeeksforGeeks
Upvotes: 0