Christian
Christian

Reputation: 23

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.

Here is my code:

from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)

Here is my output

[<div class="price">
                        
                            
                            
                                
                                
                                    $46,999.00
                                    
                                    
                                    
                                
                                
                            
                            

                            
                                
                                    <div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>

Ideally, I would only want the output to be "46,999.00".

I tried with text=True, although this did not work and I would not get any output from it besides an empty list.

Thank you

Upvotes: 2

Views: 897

Answers (2)

MendelG
MendelG

Reputation: 20038

An option without using RegEx, is to filter out tags that startwith() a dollar sign $:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

price_tags = soup.find_all("div", class_="price")

prices = [
    tag.get_text(strip=True)[1:] for tag in price_tags
    if tag.get_text(strip=True).startswith('$')
]

print(prices)

Output:

['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

Upvotes: 1

ATIF ADIB
ATIF ADIB

Reputation: 589

You need to get the text portion of tag and then perform some regex processing on it.

import re

def get_price_from_div(div_item):
    str_price = re.sub('[^0-9\.]','', div_item.text)
    float_price = float(str_price)
    return float_price

Just call this method in your code after you find the divs

from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)

Upvotes: 3

Related Questions