Hellmick
Hellmick

Reputation: 119

Problem with image source scraping in loop using Python BS4

When I try to scrape image source, all I get is a thing starting with "data:type" and base64 encoding of the image. All I want to get is URL of the image.

I tried to check i I can just condition it and skip this and then extract the URL, but it doesn't work, it just skips entire image.

Please help.

def get_product_data(url):
    response = http.request("GET", url)
    check_conn_errors(response.status)
    bs_data = Bs(response.data, "html.parser")
    product_html = bs_data.find("div", {"class": PRODUCT_DATA_CLASS_NAME})
    imgs = product_html.find_all("img")
    img_link = ""

    for i in range(len(imgs)):
        if imgs[i]["src"].startswith("/"):
            img_link = PRODUCTS_URL_PREFIX + imgs[i]["src"]
            break
        elif imgs[i]["src"].startswith("http") \
                or imgs[i]["src"].startswith("www") \
                or imgs[i]["src"].startswith(DEALER_NAME.split(' ')[0].lower()):
            img_link = imgs[i]["src"]
            break

    if img_link == "":
        print("this doesn't work")

#TODO: standardize description scraping

    desc_html = product_html.find_all("div", {"class": DESCRIPTION_CLASS_NAME})
    desc = ""

    for desc_part in desc_html:
        desc += desc_part.text.replace('\n', '&#xD;').replace('\r', '&#xD;').replace('<br/>', '&#xD;').replace('</br>', '&#xD;')
    return [desc, img_link]

Upvotes: 0

Views: 108

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

To extract product main image you can select <a> with class="woocommerce-main-image"):

import requests
from bs4 import BeautifulSoup


url = "https://sylveco.pl/produkt/sylveco-zestaw-do-pielegnacji-wlosow-niskoporowatych/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

print(soup.select_one(".woocommerce-main-image")["href"])

Prints:

https://sylveco.pl/wp-content/uploads/2021/07/niskoporowate.jpg

Upvotes: 1

Related Questions