arthem
arthem

Reputation: 151

Having some issues with Python Exceptions in my script

I am trying to scrape data from a few websites for a proof of concept project. Currently using Python3 with BS4 to collect the data required. I have a dictionary of URLS from three sites. Each of the sites requires a different method to collect the data as their HTML is different. I have been using a "Try, If, Else, stack but I keep running into issues, If you could have a look at my code and help me to fix it then that would be great!

As I add more sites to be scraped I will not be able to use "Try, If, Else" to cycle through various methods to find the correct way to scrape the data, how can I future-proof this code to allow me to add as many websites and scrape data from various elements contained within in the future?

# Scraping Script Here:

def job():

prices = {

    # LIVEPRICES

    "LIVEAUOZ":    {"url": "https://www.gold.co.uk/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "LiveAUOz"},


    # GOLD

    "GLDAU_BRITANNIA":    {"url": "https://www.gold.co.uk/gold-coins/gold-britannia-coins/britannia-one-ounce-gold-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Britannia"},
    "GLDAU_PHILHARMONIC": {"url": "https://www.gold.co.uk/gold-coins/austrian-gold-philharmoinc-coins/austrian-gold-philharmonic-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Philharmonic"},
    "GLDAU_MAPLE":        {"url":    "https://www.gold.co.uk/gold-coins/canadian-gold-maple-coins/canadian-gold-maple-coin/",
                           "trader": "Gold.co.uk",
                           "metal":  "Gold",
                           "type":   "Maple"},

    # SILVER

    "GLDAG_BRITANNIA":    {"url": "https://www.gold.co.uk/silver-coins/silver-britannia-coins/britannia-one-ounce-silver-coin-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Britannia"},
    "GLDAG_PHILHARMONIC": {"url": "https://www.gold.co.uk/silver-coins/austrian-silver-philharmonic-coins/silver-philharmonic-2020/",
                           "trader": "Gold.co.uk",
                           "metal":  "Silver",
                           "type":   "Philharmonic"}

}

response = requests.get(
    'https://www.gold.co.uk/silver-price/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
    'span', {'name': 'current_price_field'}).get_text()

# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035

for coin in prices:
    response = requests.get(prices[coin]["url"])
    soup = BeautifulSoup(response.text, 'html.parser')

    try:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 1

    except:
        text_price = soup.find(
            'td', {'id': 'total-price-inc-vat-1'}).get_text()        <-- Method 2

    else:
        text_price = soup.find(
            'td', {'class': 'gold-price-per-ounce'}).get_text()      

    # Grab the number
    prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))

# ============================================================================

root = etree.Element("root")

for coin in prices:
    coinx = etree.Element("coin")
    etree.SubElement(coinx, "trader", {
                     'variable': coin}).text = prices[coin]["trader"]
    etree.SubElement(coinx, "metal").text = prices[coin]["metal"]
    etree.SubElement(coinx, "type").text = prices[coin]["type"]
    etree.SubElement(coinx, "price").text = (
        "£") + str(prices[coin]["price"])
    root.append(coinx)

fName = './templates/data.xml'
with open(fName, 'wb') as f:
    f.write(etree.tostring(root, xml_declaration=True,
                           encoding="utf-8", pretty_print=True))

Upvotes: 0

Views: 61

Answers (1)

forgetso
forgetso

Reputation: 2484

Add a config for the scraping where each config is something like this:

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": lambda x: float(re.sub(r"[^0-9\.]", "", x))
        }

    }
}

User the selector part of price to get the relevant part of HTML and then parse it with the parser function.

e.g.

for key, config in prices.items():
    response = requests.get(config['url'])
    soup = BeautifulSoup(response.text, 'html.parser')
    price_element = soup.find(config['price']['selector'])
    if price_element:
        AG_GRAM_SPOT = price_element.get_text()
        # convert to float
        AG_GRAM_SPOT = config['price']['parser'](AG_GRAM_SPOT)
        # etc

You can modify the config object as you need but it will probably be very similar for most sites. For example, the text parsing could very well always be the same so instead of lambda function, create a function with def.

def textParser(text):
    return float(re.sub(r"[^0-9\.]", "", text))

Then add the reference to textParser in the config.

prices = {
    "LIVEAUOZ": {
        "url": "https://www.gold.co.uk/",
        "trader": "Gold.co.uk",
        "metal": "Gold",
        "type": "LiveAUOz",
        "price": {
            "selector": '#id > div > table > tr',
            "parser": textParser
        }

    }
}

These steps will allow you to write generic code, saving all those try excepts.

Upvotes: 1

Related Questions