TiRoX
TiRoX

Reputation: 31

Scraped Source code is incomplete - Loading Error

Using requests and urllib3 I grabbed the "incomplete" source code of https://www.immowelt.de/liste/berlin/ladenflaechen . The source code is incomplete because it will only contain 4 listed items, instead of 20. Looking at the resulting Source we find the following hint for it being a "loading" / pagination problem (line number 2191). The full source code I managed to get can be inspected here: https://pastebin.com/FgTd5Z2Y

<div class="error alert js-ErrorGeneric t_center padding_top_30" id="js-ui-items_loading_error" style="display: none;">
                        Unbekannter Fehler, bitte laden Sie die Seite neu oder versuchen Sie es später erneut.
</div>

Translating the Error text: Unknown error, please reload the page or try again later.

After that Error the source code for going to the next page is shown. Sadly there exists a gab between page 1 and page 2 of 16 items.

I tried to find a solution looking deeper into the library of requests and urllib3 to find anything that would help. Thus I tried a stream instead of the simple "get". Sadly it didn't help me in any way.

import requests
import urllib3

# using requests
url = "https://www.immowelt.de/liste/berlin/ladenflaechen"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")

# using urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'https://www.immowelt.de/liste/berlin/ladenflaechen')
rip = r.data.decode('utf-8')

I expected to get all items on the page, yet only got the first 4. Source code seems to show, that the simple request command will not load the entire source code like in a browser.

Upvotes: 1

Views: 179

Answers (1)

QHarr
QHarr

Reputation: 84475

The page does a POST request for more results. You can do an initial request to get the total result count and a follow up POST to get all results. Note I have a preference for requests library and we have the efficiency of re-using connection with Session object.

import requests, re
from bs4 import BeautifulSoup as bs

p = re.compile(r'search_results":(.*?),')

with requests.Session() as s:  
    r = s.get('https://www.immowelt.de/liste/berlin/ladenflaechen')
    num_results = p.findall(r.text)[0]
    body = {'query': 'geoid=108110&etype=5','offset': 0,'pageSize': num_results}
    r = s.post('https://www.immowelt.de/liste/getlistitems', data = body)
    soup = bs(r.content, 'lxml')
    print(len(soup.select('.listitem')))

Upvotes: 1

Related Questions