user13491496
user13491496

Reputation:

Web Scraping missing values

I have this simple scraping code.

Now the problem is that this works on other pages with the same CONTENT and some does not. Why is that?

To be scraped

I have an image here. The underlined location is the one I am trying to scrape and the other one is the html code. Now take note that this scraping process works on the same content with the same HTML code as seen in the image BUT! in some PAGES it is NOT WORKING! I figured it out when I tried to print out the block of html code where that span should belong but when I used the find feature in sublime, I did not find it. So it means that the block of code is missing but in some pages it actually works!

I hope I am clear with what I am trying to say. Here is the source code. Take a look at it and try it then you'll get what I am saying.

from bs4 import BeautifulSoup
import requests

url2 = "https://www.ebay.com.au/itm/Auth-Bell-Ross-Black-PVD-Mens-Wrist-Watch-46MM-BR01-92-S-Box-Docs/383600988865?hash=item59506692c1:g:-0wAAOSw1BBeVzyE"
url = "https://www.ebay.com.au/itm/Bell-Ross-BR03-94-Ceramic-Desert-Type-Chronograph-Automatic-Watch/174344419661?hash=item2897bcb94d:g:7M4AAOSwb~Ve7y0M"
rawdata = requests.get(url2)
soup = BeautifulSoup(rawdata.content,"html.parser")#try xml parser


product_block = soup.find("div",{"id":"CenterPanelInternal"})

#print(product_block)

product_name = product_block.find("h1",class_="it-ttl").text
product_condition = product_block.find("div",class_="condText").text
product_price = product_block.find("span",class_="notranslate").text
product_seller = product_block.find("div",class_="bdg-90").text.replace("\n",'')
product_loc = product_block.find("div",class_="iti-eu-bld-gry")#.text.replace("\n",'')
product_postTo = product_block.find("div",class_="vi-shp-pdg-rt")#.text.replace("\n",'')
product_img = product_block.find("img",class_="img").get("src")

print(product_name)
print(product_condition)
print(product_price)
print(product_seller)
print(product_loc)
print(product_postTo)
print(product_img)

After running the code this is the result. It is None because that block of code does not exist. None result

Now after changing the url to url2 which url2 contains the same CONTENT! again same content but different page and data but the classes and ids from the html code are the same. Then I get this result. Correct

This is so weird to be honest. Please help me :( I am missing something out? Is there something that I did not understand? Please let me know. You can copy the link btw in code :) Thank you so much! Thank you!

Upvotes: 2

Views: 755

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

Change parser to html5lib:

import requests
from bs4 import BeautifulSoup


url2 = "https://www.ebay.com.au/itm/Bell-Ross-BR03-94-Ceramic-Desert-Type-Chronograph-Automatic-Watch/174344419661?hash=item2897bcb94d:g:7M4AAOSwb~Ve7y0M"

rawdata = requests.get(url2)
soup = BeautifulSoup(rawdata.content, "html5lib")  # <--- change to "html5lib" here


product_block = soup.find("div",{"id":"CenterPanelInternal"})

#print(product_block)

product_name = product_block.find("h1",class_="it-ttl").text
product_condition = product_block.find("div",class_="condText").text
product_price = product_block.find("span",class_="notranslate").text
product_seller = product_block.find("div",class_="bdg-90").text.replace("\n",'')
product_loc = product_block.find("div",class_="iti-eu-bld-gry")#.text.replace("\n",'')
product_postTo = product_block.find("div",class_="vi-shp-pdg-rt")#.text.replace("\n",'')
product_img = product_block.find("img",class_="img").get("src")

print(product_name)
print(product_condition)
print(product_price)
print(product_seller)
print('-' * 80)
print(product_loc)
print('-' * 80)
print(product_postTo)
print('-' * 80)
print(product_img)

Prints:

...

--------------------------------------------------------------------------------
<div class="iti-eu-bld-gry">
            <span itemprop="availableAtOrFrom">Melbourne, Victoria, Australia</span>
        </div>
--------------------------------------------------------------------------------
<div class="iti-eu-bld-gry vi-shp-pdg-rt" id="vi-acc-shpsToLbl-cnt">
            <span itemprop="areaServed">
            Worldwide</span>
        </div>
--------------------------------------------------------------------------------

...

Upvotes: 1

Related Questions