ASH
ASH

Reputation: 20362

Trying to read JSON from URL and parse into CSV format

I am trying to loop through four URLs in a list, scrape the contents of each URL, and save each as a separate CSV. I think my code below is close, but it doesn't really seem to parse the JSON strings into a human-readable format. Also, the headers are missing.

Here is my hacked together code.

import urllib
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize

all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
    #print(item)
    try:
        length = len(item)
        first_under = item.find('_') + 1
        last_under = item.rfind('?') - 21
        file_name = item[first_under:last_under]
        r = requests.get(item)
        print(r.json)
        df = pd.DataFrame(r)
        df.head()
        DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
        #urllib.request.urlretrieve(df,DOWNLOAD_PATH)
        r = requests.get(item)
        with open(DOWNLOAD_PATH,'wb') as f:
            f.write(r.content)
    except Exception as e: print(e)
   

This is what the data looks like. Is this right? I thought the data would look a lot cleaner if it was converted from JSON to CSV.

enter image description here

Upvotes: 0

Views: 825

Answers (2)

Adid
Adid

Reputation: 1584

You are close, here's what you need to change:

  1. You can use pandas dataframes to read json using df = pd.read_json(text, lines=True) - for this make sure to specify lines=True because some of your data contains \n characters
  2. You can use the same dataframe to output to a csv using df.to_csv(file)

All in all, there are some things in your code that could be removed, e.g. you're calling requests.get twice for no real reason, which slows your code down substantially.

import requests
import pandas as pd

all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
  'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
    try:
        length = len(item)
        first_under = item.find('_') + 1
        last_under = item.rfind('?') - 21
        file_name = item[first_under:last_under]
        r = requests.get(item)
        df = pd.read_json(r.text, lines=True)
        DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
        with open(DOWNLOAD_PATH,'wb') as f:
            df.to_csv(f)
    except Exception as e: print(e)

Upvotes: 1

a_n
a_n

Reputation: 186

I am not sure if you are asking for opinion. Is CSV necessary? Or are you just trying to save it in a readable (and preferrably, programmably extractable) format?

Anyways, here is the way to export one result to a .csv file:

# Required imports
import json
import csv

"""I am assuming you already made the request and got the response"""
data_set = r.json()   # An iterable of python dict objects
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
# (Please take enough care to not reveal personal info above)

# getting the headers
headers = data_set[0].keys()

with open(DOWNLOAD_PATH, "wt", newline="") as file:
    writer = csv.writer(file)
    
    writer.writerow(headers)    # writing the headers

    for data in data_set:    # writing the values
        write.writerow(data.values())

And here is the way to export one result to a .json file:

import json

data_set = r.json()
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.json'

with open(DOWNLOAD_PATH, "wt") as file:
    json.dump(data_set, file)

You can use whichever file is more readable to you. Note that the above methods were for a single request result. You will have to do the above for each result you extract from the website.

Credits for the CSV method: GeeksforGeeks

Upvotes: 0

Related Questions