Sahli9876
Sahli9876

Reputation: 33

python webscraping results in block

I want to webscrape german real estate website immobilienscout24.de. It is not intended for commercial use or publication and I do not intend on spamming the site, it is merely for coding practice. I would like to write a python tool that automatically downloads the HTML of given immobilienscout24.de sites. I have tried to use beautifulsoup for this, however, the parsed HTML doesn't show the content but asks if I am a robot etc., meaning my webscraper got detected and blocked (I can access the site in Firefox just fine). I have set a referer, a delay and a random user agent. What else can I do to avoid being detected (i.e. rotating proxies, random clicks, headless chrome, this script, other webscraping tools that don't get detected...)? Things I found online that might be the reason for the block:

If someone has a working solution with which one can scrape the site, say, 10 times without being blocked, I would be very thankful. Here is my code so far:

from bs4 import BeautifulSoup
import numpy
import time
from fake_useragent import UserAgent

def get_html(url, headers): #scrapes and parses the HTML of a given URL while using custom header
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

ua = UserAgent()
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "de,de-DE;q=0.8,en;q=0.6", 
    "Dnt": "1", 
    "Host": "https://www.immobilienscout24.de/", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": ua.random, 
  }
delays = [3, 5, 7, 4, 4, 11]
time.sleep(numpy.random.choice(delays))
test = get_html("https://www.immobilienscout24.de/Suche/de/baden-wuerttemberg/heidelberg/wohnung-kaufen?enteredFrom=one_step_search", headers)```

Upvotes: 1

Views: 319

Answers (1)

alex hernandez
alex hernandez

Reputation: 136

this code still need more work but my guess for the import requests doesn't work because it need to run js code but if you use something like selenium it should work cause it could run js code .Also request_html has puppeteer is similar to selenium .


from requests_html import HTMLSession
import re
#from fake_useragent import UserAgent
#create the session
#ua = UserAgent()
session = HTMLSession()

#define our URL
url = 'https://www.immobilienscout24.de/Suche/de/baden-wuerttemberg/heidelberg/wohnung-kaufen'

#use the session to get the data
r = session.get(url)

#Render the page, up the number on scrolldown to page down multiple times on a page
r.html.render(sleep=1,timeout = 30, keep_page=True, scrolldown=1)
r.html.page.screenshot({'C:/Users/program/Desktop/help': 'example.png'})

print(r.text)
session.close()

Upvotes: 1

Related Questions