ge0rg
ge0rg

Reputation: 73

Getting HTTP Error 500 while trying to open URL in python using various libraries (urllib, requests, curl_cffi)

As the title says, I can't seem to manage to get the html from the page below, or any listing from https://stadtundland.de/wohnungssuche?district=all for that matter. I want to scrape some of the text from the listing using bs4 (and eventually do browser automation on the form at the bottom of the site). The page opens fine in a browser.

I have:

url = "https://stadtundland.de/wohnungssuche/1001%2F5156%2F00130"

if this listing expires by the time you read this, you can replace it with any one from https://stadtundland.de/wohnungssuche?district=all

from urllib.request import urlopen, Request

urlopen(url)

results in HTTPError: HTTP Error 500: Internal Server Error

req = Request(url)
urlopen(req)

also results in HTTPError: HTTP Error 500: Internal Server Error

import requests
requests.get(url)

yields: ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

Because of this I thought that adding some headers might help, but I am quite new to this and I'm not sure if I am doing this correctly. What I did:

headers = {
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br, zstd",
    "accept-language": "en-US,en;q=0.9,de;q=0.8",
    "cache-control": "max-age=0",
    "content-type": "application/json;charset=UTF-8",
    "origin": "https://stadtundland.de",
    "referer": "https://stadtundland.de/",
    "sec-ch-ua": "'Chromium';v='130', 'Google Chrome';v='130', 'Not?A_Brand';v='99'",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "'macOS",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "cross-site",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
}
req =requests.get(url, headers=scraper.headers)

This returns Response 500.

Another thing I tried:

from curl_cffi import requests as cureq
cureq.get(url)

Results in response 500 as well. I tried, where applicable all of the above with the POST method as well, which changed nothing. Trying

cureq.get(url, headers=headers)

also gives response 500.

That's where my ideas and my scarce knowledge of this topic ends, sadly, so I'm happy about any help that is offered.

EDIT: I just tried to do this with Selenium as a last resort. It worked (code below), but I would really really like to make this part work without Selenium if possible, so the question still stands.

The code:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get(url)

html = driver.page_source
driver.quit()

soup = BeautifulSoup(html, "html.parser")

EDIT 2 Simplified Selenium stuff a bit.

Upvotes: 0

Views: 61

Answers (0)

Related Questions