Reputation: 467
I'm very new to web scraping. I know nothing about cookies, which seem to be the problem here. I'm trying something very simple, i.e. doing a request.get() on some website, then playing with Beautiful Soup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20")
print page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This basically doesn't work, as the print(soup.prettify()) says: "Request unsuccessful. Incapsula incident ID: 449001030063484539-234265426366891642"
That's ok, I found out that it's because my get needs some cookies. So, I used the method described here, create a dict of the cookies, and passed it as an argument of my get:
cookies = {'incap_ses_449_150286':'ll/1bp9r6ifi7LPUDiw7Bi/dzlwAAAAAO6OR80W3VDDesKNGYZv4PA==', 'visid_incap_150286':'+Tg7VstMS1OzBycT4432Ey/dzlwAAAAAQUIPAAAAAAAqAettOJXSb8ocwxkzabRx'}
page = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20", cookies=cookies)
...and now the print(soup.prettify()) prints the whole page, ok.
But, basically, if I shut down my computer, and come back the next day, and run my script again, it seems these cookies I hardcoded are now wrong, because they've actually changed, right? And this is what I observe, just re-running my script doesn't seem to work anymore. I guess this is normal 'cookies behavior', to change from one day to another (?).
So, I thought I might get these automatically, before doing my request.get(). So I did this:
session = requests.Session()
response = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20")
cookies = session.cookies.get_dict()
When doing this, I do get 2 cookies (the 'incap_ses_449_150286', and the other), but with different values than what I see if I use Chrome's developers tools on the web page. And passing these cookies to my get() doesn't seem to work (although I don't have anymore the "Request unsuccessful" message, but the print(soup.prettify()) prints close to nothing.. The only way I have it working correctly is by manually encoding the cookies in the dict, by looking them using Chrome's tools... What am I missing?
Thanks a lot! Arnaud
Upvotes: 0
Views: 784
Reputation: 6426
This isn't a Python issue. This is the web server you're connecting to being very specific as to what it lets access its site. Something is different between your web browser and requests
that the web server is detecting that causes it to allow one and deny the other. The cookies are probably present so it doesn't have to keep doing this detection (Cloudflare?) and by copying the cookies from Chrome to requests
you're circumventing it.
Have you tried setting the user agent to Chrome's? Also, check the site's robots.txt
to see whether it allows web scrapers; it might be that the website owners don't want you doing this; it seems as though they've already put into place measures to prevent it.
Upvotes: 1