Kev
Kev

Reputation: 361

Web scraper return multiple errors

I'm doing a web scraper of an insurance webpage that retrieves me in a CSV the model, brand, subbrand, and the description and when I run my code sometimes it works and other times gives me multiple errors ( "list indices must be integers", "Expecting value: line 1 column 1", "JSON decoder is not working")

I've tried to insert prints and try to see where was the problem but still not get it.

import requests
import time
import json


session = requests.Session()
request_marcas = session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/brands-subbrands')
data = request_marcas.json()
fileCSV = open("webscraper_test.csv", "a")
fileCSV.write('Modelo' + ';' + 'ID_Marca' + ";" + 'ID_Submarca' + ";" + "ID_Tipo" + ";" + "Marca" +";"+ "Tipo"+ 'Descripcion' + "\n")

for i in range(2019, 2020):
        for marca in data['MARCA']:
            for submarca in marca['SUBMARCAS']:
                modelos = []
                modelos.append('https://www.citibanamexchubb.com/api/chubbnet/auto/models/' + marca['ID'] + '/' + submarca['ID'] + '/' + str(i))
                for link in modelos:
                    json_link = []
                    request_link = session.get(link).json()
                    json_link.append(request_link)
                    #print(request_link)
                    for desc_id in request_link['TIPO']:
                        #print(desc_id['ID'])
                        desc_detail = []
                        desc_detail.append(session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + desc_id['ID'] + '/2018').json())
                        #print(desc_detail)
                        try:
                            for desc in desc_detail['DESCRIPCION']:
                                print(desc['DESC'])
                        except Exception as e:
                            None

Upvotes: 1

Views: 63

Answers (1)

YellowShark
YellowShark

Reputation: 2269

So there's some weird variances in the auto/models endpoint that you're scraping. For instance, https://www.citibanamexchubb.com/api/chubbnet/auto/models/7/8/2019 returns this:

{
  "TIPO": {
    "ID": "381390223",
    "DESC": "MINI COOPER"
  }
}

While https://www.citibanamexchubb.com/api/chubbnet/auto/models/1/1/2019 return this:

{
  "TIPO": [
    {
      "ID": "364026215",
      "DESC": "MDX"
    },
    {
      "ID": "364026216",
      "DESC": "RDX"
    },
    {
      "ID": "364031544",
      "DESC": "ILX"
    },
    {
      "ID": "364031613",
      "DESC": "TLX"
    },
    {
      "ID": "364031674",
      "DESC": "NSX"
    }
  ]
}

So in the first one, "TIPO" is a dict, while in the second one, "TIPO" is a list. I've made a modification to your script that gets it running without throwing any errors. I'm sure it's not quite what yo'ure looking for, but it at least handles that difference between the two types:

import requests
import time
import json


session = requests.Session()
request_marcas = session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/brands-subbrands')
data = request_marcas.json()
fileCSV = open("webscraper_test.csv", "a")
fileCSV.write('Modelo' + ';' + 'ID_Marca' + ";" + 'ID_Submarca' + ";" + "ID_Tipo" + ";" + "Marca" +";"+ "Tipo"+ 'Descripcion' + "\n")

for i in range(2019, 2020):
        for marca in data['MARCA']:
            for submarca in marca['SUBMARCAS']:
                modelos = []
                modelos.append('https://www.citibanamexchubb.com/api/chubbnet/auto/models/' + marca['ID'] + '/' + submarca['ID'] + '/' + str(i))
                for link in modelos:
                    json_link = []
                    request_link = session.get(link).json()
                    json_link.append(request_link)
                    #print(request_link)

                    # here's where I've made some changes:
                    desc_detail = []
                    if isinstance(request_link['TIPO'], dict):
                        desc_detail.append(session.get(
                            'https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + request_link['TIPO'][
                                'ID'] + '/2018').json())
                        print(request_link['TIPO']['DESC'])
                    elif isinstance(request_link['TIPO'], list):
                        for item in request_link['TIPO']:
                            desc_detail.append(session.get('https://www.citibanamexchubb.com/api/chubbnet/auto/descriptions/' + item['ID'] + '/2018').json())
                            print(item['DESC'])

Hope that helps!

Upvotes: 2

Related Questions