Gian Franco Tan
Gian Franco Tan

Reputation: 55

Not getting data from retailers website using Python BeautifulSoup

I'm try to scrape the price of particular website. The one i've been practicing to scrape is https://www.harveynorman.com.au/asus-f402wa-ga019t-14-inch-laptop.html

import json
import requests

session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
jar.set('incap_ses_572_39856', 'wuEvYO64IwcG0nzjJijwB+oi3FwAAAAA0mUuBJjlb55z2q8aD0K/Ug==; SLIBeacon=5cdc22e9ece4f; SLIUserID=168578381; __utma=137779881.1422157795.1557930730.1557930730.1557930730.1; __utmc=137779881; __utmz=137779881.1557930730.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; _gcl_au=1.1.866045692.1557930730; _ga=GA1.3.1422157795.1557930730; _gid=GA1.3.1396003810.1557930731; _caid=067e96e9-dfea-4ff0-bd40-7729d204dc3c; _cavisit=16abbe8672e|; gdprContinent=NOT-EU; SLIBeacon=5cdc22e9ece4f; _fbp=fb.2.1557930734066.1140960424; _hjIncludedInSample=1; inptime0_3986_au=0; com.silverpop.iMAWebCookie=5621042d-8a53-3d48-d144-beb9db181190; com.silverpop.iMA.session=83ab8550-a067-6e50-7239-411cde0ad75d; com.silverpop.iMA.page_visit=-303946284:; reloadLists=true; inpsession_3986_au=03BA299D-6307-61F5-DD5E-F3F561CCA385; __gads=ID=d4a1dce2efb966ac:T=1557930751:S=ALNI_MaarXiiUHzcInDtMvu3BU8YWN9ziw; LPVID=FhMTIwOTc4YzY5N2VjNDhl; LPSID-58902652=tfROAwmpTgu9u-avZulSqg; inptime_3986_au=120; __utmb=137779881.2.10.1557930730; _gat_UA-5631569-15=1; _gat_UA-5631569-18=1')

session.cookies = jar
r = session.get('https://www.harveynorman.com.au/applybuy/apply/product?id=283011&price=297&_=1557930879834')

print(r.text)

My expected result was to find a json data to use or the whole html. Unfortunately, even using cookies I haven't get some sort of data. The result was:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

Need help how with this kind of issue without using selenium or scrapy. Thanks!

Upvotes: 0

Views: 301

Answers (1)

Dhamodharan
Dhamodharan

Reputation: 309

This site is secured with Anti-bots (CDN). Incapsula is a one of the major Anti-bot network in the market. It uses advanced ML based algorithms to detect whether it's a bot or human based on so many parameters including Browser Fingerprinting.

There are few ways we can try site like this,

  1. Using Proper Headers which impersonate the browser
  2. Using Premium proxies (ie. residential proxies) such as Microleaves,Crawlera
  3. Proper time intervals for each request and proxy rotations

There are even some pre-defined libraries are developed to bypass the firewall like incapsula-cracker-py3.

Whatever you are trying, the code should impersonate the actual human.

Upvotes: 1

Related Questions