Reputation: 711
I would like to scrape some ads for personal use from mobile.de.
I am using python 3.6 with requests lib, but I am facing issue with some bot inspection. How could I pass this gateway from their website?
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.mobile.de/?lang=en")
bs = BeautifulSoup(r.content, 'lxml')
print(bs)
This part of code displays me following:
<p>To continue your browser has to accept cookies and has to have JavaScript enabled.</p>
Where can I find the logic that I need to solve in order to pass this?
Upvotes: 11
Views: 34724
Reputation: 3107
The reason you got unexpected content is you do not have a valid header. Just like @afit said. But To continue your browser has to accept cookies and has to have JavaScript enabled.
is making sense, because if you do not enable JavaScript you won't load full of content.
Note: I recommend you use selenium
to do this. requests_html
can't access website successfully dues to lack of suitable header while it is rendering. Btw, if you want to access the url inside JavaScript and grab content, it will be tough job.
from bs4 import BeautifulSoup
from selenium import webdriver
dr = webdriver.Chrome()
dr.get("https://www.mobile.de/?lang=en")
bs = BeautifulSoup(dr.page_source,"lxml")
Upvotes: 8
Reputation: 2035
They could be doing this a number of different ways, ranging from trivial to tricky to bypass at scale. One approach would be to modify your User-Agent
, as their simplest approach would be to deny requests based on that.
r = requests.get(
'https://yoursite.com',
headers = {
'User-Agent': 'Popular browser\'s user-agent',
}
)
It doesn't look like it from the example URL you show, but they could be expecting that URL to be hit after hitting another page on the site that drops a cookie. If that's the case, make the earlier request and provide the cookie in your requests
call.
Upvotes: 5