Scrape webpage: Getting search results from a webpage

Question

I am trying to scrape a webpage using python. The webpage URL is: https://kollainkomster.se/postnummer/

The webpage contains a search bar:

which I want to submit an input to. Such an input could for instance be: "17568". The input is submitted via a button:

After the input has been submitted I want to extract the element: snittlon, from the returned information below:

Postnummer 17568 har 
550 628,00 i snittinkomst och 316 
i placering.

I am unsure of how to achieve this. One thing which seems to complicate matters is that the displayed URL remains unchanged during the input submission.

This is all I've got so far:

import requests
from lxml import html

r = requests.get('https://kollainkomster.se/postnummer/', 
headers ={"user-agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:73.0) Gecko/20100101 Firefox/73.0'})

sim · Accepted Answer

As with all scraping tasks, please be aware that it might not be legal to do. Make sure to check with the site providers whether it is fine and use appropriate timeouts in your scripts to not unduly hit the site providers.

That out of the way, see below an example (you might need to use your header configuration) of how you could get the information (note that javascript execution might not be the safest way and writing your own parser of information might be the better option):

import requests
from bs4 import BeautifulSoup
from quickjs import Function

def get_value(script_code):
    f = Function(name="value_getter",
                 code=
        """
        function value_getter(){
            %s
        }""" % (re.sub(r'accounting\.formatNumber$(?P.*), 2, " ", ","$', 
                       r'"\g<1>"', 
                       script_code.replace('document.getElementById("snittlon").innerHTML =', 
                                           'return')))
    )
    return f()

headers = {"user-agent": '[redacted: your user agent]'}
session = requests.Session()
session.get(r"https://kollainkomster.se/postnummer/", headers=headers)
r = session.post(r"https://kollainkomster.se/postnummer/", 
                 headers=headers, 
                 data={"cf-name": 17568,                                                                   
                       "cf-submitted": "Sök"})
soup = BeautifulSoup(r.text)
income = get_value(soup.find(class_="postnrresultat resultat").find("script").text)
post_number, _, ranking = [x.text for x in 
                           soup.find(class_="postnrresultat resultat").findAll("strong")]
print(income, post_number, ranking)

Output:

393568.4065 17568 1414

Scrape webpage: Getting search results from a webpage

Answers (1)

Related Questions