Gabriele Da Re
Gabriele Da Re

Reputation: 47

How to format a scraper output

I'm trying to extrapolate the prices out of one site in order to create a scraper I wrote the program down below. In order to get all the html code i used BeautifulSoup and the default html.parser. then I tried cleaning up the information by using a variable called generale equals to soup.findAll("span"). then I need to clean up furthermore (the list (i suppose) it has been created) in order to get to the prices and I got stuck. Any suggestions? I do not know how to think in order to solve the problem

import smtplib

import time

from bs4 import BeautifulSoup as bs

import requests

URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/"

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"}

def Check_page1():

    page = requests.get(URL, headers=headers)

    soup = bs(page.content, 'html.parser')

    generale = soup.findAll('span')

    price = ?

    print(price)

    print(generale)

print(Check_page1())

Upvotes: 1

Views: 138

Answers (2)

Roland Smith
Roland Smith

Reputation: 43533

There doesn't seem to be a <span class="price">. Here's what I did.

In [1]: import requests 
   ...:  
   ...: URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/" 
   ...:  
   ...: headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"} 
Out[1]: {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

In [2]: page = requests.get(URL, headers=headers)                                                        
Out[2]: <Response [200]>

In [3]: import re                                                                                        

In [4]: re.findall(r'<span.*?</span>', page.text)

There are a lot of spans. To me, the following looked most like prices.

 '<span class="topclick-list-element-price">10.56&euro;</span>',
 '<span class="topclick-list-element-price">2.79&euro;</span>',
 '<span class="topclick-list-element-price">2.90&euro;</span>',
 '<span class="topclick-list-element-price">27.86&euro;</span>',
 '<span class="topclick-list-element-price">11.15&euro;</span>',
 '<span class="topclick-list-element-price">11.46&euro;</span>'

So I refined the regular expression

In [7]: prices = [float(p) for p in re.findall(r'<span class="topclick-list-element-price">(.*)&euro;</span>', pag
   ...: e.text)] 

In [8]: print(prices)                                                                                    
[10.56, 2.79, 2.9, 27.86, 11.15, 11.46, 11.2, 18.67, 9.69, 24.25,
20.25, 19.59, 44.21, 28.3, 31.92, 41.39, 4.76, 24.57, 8.75, 28.62, 
27.14, 8.52, 31.95, 24.59, 27.93, 27.86, 5.5, 24.99, 37.99, 14.27, 
36.0, 8.75, 35.99, 37.34, 23.4, 22.98, 31.95, 36.89, 25.57, 27.9, 
35.88, 41.39, 33.22, 42.29, 31.29, 42.29, 38.09, 33.89, 33.59, 28.83,
10.56, 2.79, 2.9, 27.86, 11.15, 11.46, 11.2, 18.67, 9.69, 24.25, 
20.25, 19.59, 44.21, 28.3, 31.92, 41.39, 4.76, 24.57, 8.75, 28.62, 
27.14, 8.52, 31.95, 24.59, 27.93, 27.86, 5.5, 24.99, 37.99, 14.27, 
36.0, 8.75, 35.99, 37.34, 23.4, 22.98, 31.95, 36.89, 25.57, 27.9, 
35.88, 41.39, 33.22, 42.29, 31.29, 42.29, 38.09, 33.89, 33.59, 28.83, 
24.25, 12.11, 28.84, 37.36, 23.71, 2.19, 2.99, 34.25, 11.38, 14.99, 
20.67, 4.99, 25.56, 1.81, 12.99, 19.73, 9.99, 9.99, 0.92, 11.99, 
27.93, 22.94, 8.46, 32.78, 40.03, 11.19, 12.45, 13.29, 13.9, 26.22, 
26.22, 23.34, 25.22, 32.78, 37.36, 21.5, 19.01, 26.53, 24.91, 17.96, 
35.4, 17.05, 21.56, 16.39, 35.4, 8.98, 65.54, 13.45, 15.73, 22.39, 
17.99, 40.17, 8.0, 11.34, 14.99, 17.99, 10.99, 24.99, 22.41, 17.99, 
40.17, 7.2, 49.99, 41.1, 39.85, 16.99, 19.99, 21.99, 10.99, 19.73, 
14.99, 22.39, 6.55, 32.98, 27.99, 29.89, 19.99, 29.99, 37.36, 19.99, 
35.49, 15.99, 21.99, 46.71, 15.72, 42.97, 18.68, 18.87, 15.72, 19.99,
 29.99, 9.99, 28.02, 35.99, 39.99, 15.72, 15.72, 9.33, 44.48, 47.99, 
43.99, 47.99, 38.8, 23.27, 20.69, 44.6, 41.97, 15.75, 44.49, 19.87, 
51.99, 36.89, 15.99, 39.99, 27.99, 11.58, 43.99, 41.1, 19.99, 43.64, 
19.99, 36.89, 25.69]

Upvotes: 0

Jan Lipovsk&#253;
Jan Lipovsk&#253;

Reputation: 371

When you look at the source code of the page you can see that you are looking for <span> with class name price, And it can be parsed this way:

import time

import requests
from bs4 import BeautifulSoup as bs

URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/"
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"}

def CheckPage1():
    page = requests.get(URL, headers=headers)
    soup = bs(page.content, 'html.parser')

    # all spans with prices
    span_prices = soup.findAll("span", {"class": "price"})

    # to get all prices you need to extract text or content attribute
    for span in span_prices:
        price = span.text
        # remove whitespace and print price
        print(price.strip())

        # to get prices without money sign uncomment one of those lines
        # print(price.strip()[:-1])
        # print(price.strip().strip('€'))

CheckPage1()

Upvotes: 1

Related Questions