Not getting data from retailers website using Python BeautifulSoup

Question

I'm try to scrape the price of particular website. The one i've been practicing to scrape is https://www.harveynorman.com.au/asus-f402wa-ga019t-14-inch-laptop.html

import json
import requests

session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
jar.set('incap_ses_572_39856', 'wuEvYO64IwcG0nzjJijwB+oi3FwAAAAA0mUuBJjlb55z2q8aD0K/Ug==; SLIBeacon=5cdc22e9ece4f; SLIUserID=168578381; __utma=137779881.1422157795.1557930730.1557930730.1557930730.1; __utmc=137779881; __utmz=137779881.1557930730.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; _gcl_au=1.1.866045692.1557930730; _ga=GA1.3.1422157795.1557930730; _gid=GA1.3.1396003810.1557930731; _caid=067e96e9-dfea-4ff0-bd40-7729d204dc3c; _cavisit=16abbe8672e|; gdprContinent=NOT-EU; SLIBeacon=5cdc22e9ece4f; _fbp=fb.2.1557930734066.1140960424; _hjIncludedInSample=1; inptime0_3986_au=0; com.silverpop.iMAWebCookie=5621042d-8a53-3d48-d144-beb9db181190; com.silverpop.iMA.session=83ab8550-a067-6e50-7239-411cde0ad75d; com.silverpop.iMA.page_visit=-303946284:; reloadLists=true; inpsession_3986_au=03BA299D-6307-61F5-DD5E-F3F561CCA385; __gads=ID=d4a1dce2efb966ac:T=1557930751:S=ALNI_MaarXiiUHzcInDtMvu3BU8YWN9ziw; LPVID=FhMTIwOTc4YzY5N2VjNDhl; LPSID-58902652=tfROAwmpTgu9u-avZulSqg; inptime_3986_au=120; __utmb=137779881.2.10.1557930730; _gat_UA-5631569-15=1; _gat_UA-5631569-18=1')

session.cookies = jar
r = session.get('https://www.harveynorman.com.au/applybuy/apply/product?id=283011&price=297&_=1557930879834')

print(r.text)

My expected result was to find a json data to use or the whole html. Unfortunately, even using cookies I haven't get some sort of data. The result was:

Need help how with this kind of issue without using selenium or scrapy. Thanks!

Dhamodharan · Accepted Answer

This site is secured with Anti-bots (CDN). Incapsula is a one of the major Anti-bot network in the market. It uses advanced ML based algorithms to detect whether it's a bot or human based on so many parameters including Browser Fingerprinting.

There are few ways we can try site like this,

Using Proper Headers which impersonate the browser
Using Premium proxies (ie. residential proxies) such as Microleaves,Crawlera
Proper time intervals for each request and proxy rotations

There are even some pre-defined libraries are developed to bypass the firewall like incapsula-cracker-py3.

Whatever you are trying, the code should impersonate the actual human.

Not getting data from retailers website using Python BeautifulSoup

Answers (1)

Related Questions