Adlet Dairbekov
Adlet Dairbekov

Reputation: 29

Beautiful Soup web scraping/ getting product link

I am trying to get a product name and its price from one local website, for this I am using Beautiful Soup. My code:

 productlinks = []

for x in range(1,3):
    r = requests.get(F'https://www.mechta.kz/section/stiralnye-mashiny/?arrFilter5_pf%5BNEW%5D=&arrFilter5_pf%5BARFP%5D=43843%2C43848&arrFilter5_pf%5BPROMOCODE_PROCENT%5D%5BLEFT%5D=&arrFilter5_pf%5BPROMOCODE_PROCENT%5D%5BRIGHT%5D=&arrFilter5_pf%5BMINPRICE_s1%5D%5BLEFT%5D=38990&arrFilter5_pf%5BMINPRICE_s1%5D%5BRIGHT%5D=1171000&set_filter=Y&PAGEN_2={x}')
    soup = BeautifulSoup(r.content, 'lxml')
    productlist = soup.find_all('div', class_='aa_st_img iprel')
    for item in productlist:
            for link in item.find_all('a', href=True):
                productlinks.append(baseurl + link['href'])

The code works good, however It does not get all products from the website, it skips some products (no links to the products)

Could you please suggest a solution for this problem

Thanks!

Upvotes: 0

Views: 295

Answers (2)

Paktas
Paktas

Reputation: 318

You can try other product URL sourcing options as per the schema below. In your specific case Mechta has sitemap index - fetch those and parse XML.

product URL sourcing options for scraping

Upvotes: 0

Denis Tsoi
Denis Tsoi

Reputation: 10414

It looks like according to the link that the class j_product_link has all the links, therefor we can find all tags with class j_product_link.

e.g.

soup.find_all('a', class_='j_product_link')

possible solution

for x in range(1,3):
    r = requests.get(F'https://www.mechta.kz/section/stiralnye-mashiny/?arrFilter5_pf%5BNEW%5D=&arrFilter5_pf%5BARFP%5D=43843%2C43848&arrFilter5_pf%5BPROMOCODE_PROCENT%5D%5BLEFT%5D=&arrFilter5_pf%5BPROMOCODE_PROCENT%5D%5BRIGHT%5D=&arrFilter5_pf%5BMINPRICE_s1%5D%5BLEFT%5D=38990&arrFilter5_pf%5BMINPRICE_s1%5D%5BRIGHT%5D=1171000&set_filter=Y&PAGEN_2={x}')
    soup = BeautifulSoup(r.content, 'lxml')
    productlist = soup.find_all('a', class_='j_product_link')
    for link in productlist:
        productlinks.append(baseurl + link['href'])

Upvotes: 1

Related Questions