Baby Yoda
Baby Yoda

Reputation: 73

I am getting a KeyError on a JSON that I scraped

I have scraped a JSON from a website. When trying to iterate through the JSON I get a KeyError, but I'm unsure why. The loop is within the length of the JSON. Any ideas as to what is going on?

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
url = "https://employment.ucsd.edu/jobs?page_size=250&page_number=1&keyword=clinical%20lab%20scientist&location_city" \
      "=Remote&location_city=San%20Diego&location_city=Encinitas&location_city=Murrieta&location_city=La%20Jolla" \
      "&location_city=Not%20Specified&location_city=Vista&sort_by=score&sort_order=DESC "
request = requests.get(url, headers=headers)
response = BeautifulSoup(request.text, "html.parser")
all_data = response.find_all("script", {"type": "application/ld+json"})
df = pd.DataFrame(columns=("Title", "Department", "Salary Range", "Appointment Percent", "URL"))

for data in all_data:
    jsn = json.loads(data.string)
    jsn_length = len(jsn['itemListElement'])
    # print(json.dumps(jsn, indent=4))
    n = 0
    while n < jsn_length:
        # print(jsn['itemListElement'][n])
        print(n)
        df['URL'] = jsn['itemListElement'][n]
        n += 1

Edit: response

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2022.1\plugins\python\helpers\pydev\pydevd.py", line 1491, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2022.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Will/PycharmProjects/UCSD_JOB_SCRAPE/main.py", line 19, in <module>
    jsn_length = len(jsn['itemListElement'])
KeyError: 'itemListElement'

Upvotes: 2

Views: 264

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195528

To get list of URLs into a DataFrame you can use next example:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}

url = (
    "https://employment.ucsd.edu/jobs?page_size=250&page_number=1&keyword=clinical%20lab%20scientist&location_city"
    "=Remote&location_city=San%20Diego&location_city=Encinitas&location_city=Murrieta&location_city=La%20Jolla"
    "&location_city=Not%20Specified&location_city=Vista&sort_by=score&sort_order=DESC "
)

request = requests.get(url, headers=headers)
soup = BeautifulSoup(request.content, "html.parser")

data = json.loads(soup.find("script", {"type": "application/ld+json"}).text)

urls = []
for e in data["itemListElement"]:
    urls.append(e["url"])

df = pd.DataFrame({"URL": urls})
print(df.head())

Prints:

                                                                                    URL
0      http://employment.ucsd.edu/clinical-lab-scientist-specialist-119559/job/20822209
1      http://employment.ucsd.edu/clinical-lab-scientist-specialist-120139/job/21460814
2      http://employment.ucsd.edu/clinical-lab-scientist-specialist-120483/job/21869984
3   http://employment.ucsd.edu/sr-clinical-lab-scientist-specialist-118105/job/20528292
4  http://employment.ucsd.edu/cls-clinical-lab-scientist-specialist-119095/job/20528293

Upvotes: 1

Mureinik
Mureinik

Reputation: 311823

Element number 250 in the JSON you referenced really doesn't seem to have an itemListElement key:

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "url": "https://health.ucsd.edu/",
  "logo": "https://dy5f5j6i37p1a.cloudfront.net/company/logos/157272/original/b228c5f9007911ecb905ed1c0f90d00e.png",
  "name": "UC San Diego "
}

The safest thing is probably to explicitly check against it. E.g.:

for data in all_data:
    jsn = json.loads(data.string)
    if jsn.get('itemListElement') is None:
        print('No itemListElement in the JSON. The JSON is\n' + data.string)
    else:
        jsn_length = len(jsn['itemListElement'])
        n = 0
        while n < jsn_length:
            # print(jsn['itemListElement'][n])
            print(n)
            df['URL'] = jsn['itemListElement'][n]
            n += 1

Upvotes: 2

Related Questions