Reputation: 119
When I try to scrape image source, all I get is a thing starting with "data:type" and base64 encoding of the image. All I want to get is URL of the image.
I tried to check i I can just condition it and skip this and then extract the URL, but it doesn't work, it just skips entire image.
Please help.
def get_product_data(url):
response = http.request("GET", url)
check_conn_errors(response.status)
bs_data = Bs(response.data, "html.parser")
product_html = bs_data.find("div", {"class": PRODUCT_DATA_CLASS_NAME})
imgs = product_html.find_all("img")
img_link = ""
for i in range(len(imgs)):
if imgs[i]["src"].startswith("/"):
img_link = PRODUCTS_URL_PREFIX + imgs[i]["src"]
break
elif imgs[i]["src"].startswith("http") \
or imgs[i]["src"].startswith("www") \
or imgs[i]["src"].startswith(DEALER_NAME.split(' ')[0].lower()):
img_link = imgs[i]["src"]
break
if img_link == "":
print("this doesn't work")
#TODO: standardize description scraping
desc_html = product_html.find_all("div", {"class": DESCRIPTION_CLASS_NAME})
desc = ""
for desc_part in desc_html:
desc += desc_part.text.replace('\n', '
').replace('\r', '
').replace('<br/>', '
').replace('</br>', '
')
return [desc, img_link]
Upvotes: 0
Views: 108
Reputation: 195438
To extract product main image you can select <a>
with class="woocommerce-main-image"
):
import requests
from bs4 import BeautifulSoup
url = "https://sylveco.pl/produkt/sylveco-zestaw-do-pielegnacji-wlosow-niskoporowatych/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one(".woocommerce-main-image")["href"])
Prints:
https://sylveco.pl/wp-content/uploads/2021/07/niskoporowate.jpg
Upvotes: 1