MarG
MarG

Reputation: 13

Scrape webpage: Getting search results from a webpage

I am trying to scrape a webpage using python. The webpage URL is: https://kollainkomster.se/postnummer/

The webpage contains a search bar:

<input type="text" name="cf-name" pattern="[0-9 ]+" value="" placeholder="25245" size="40">

which I want to submit an input to. Such an input could for instance be: "17568". The input is submitted via a button:

<input type="submit" name="cf-submitted" value="Sök">

After the input has been submitted I want to extract the element: snittlon, from the returned information below:

<p class="postnrresultat resultat">Postnummer <strong>17568</strong> har <strong>
<span id="snittlon">550 628,00</span></strong> i snittinkomst och <strong>316</strong> 
i placering.<script> if(isNaN("550628.0049")){document.getElementById("snittlon").innerHTML = "550628.0049"} 
else {document.getElementById("snittlon").innerHTML = accounting.formatNumber(550628.0049, 2, " ", ",");}
</script></p>

I am unsure of how to achieve this. One thing which seems to complicate matters is that the displayed URL remains unchanged during the input submission.

This is all I've got so far:

import requests
from lxml import html

r = requests.get('https://kollainkomster.se/postnummer/', 
headers ={"user-agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:73.0) Gecko/20100101 Firefox/73.0'})

Upvotes: 0

Views: 120

Answers (1)

sim
sim

Reputation: 1257

As with all scraping tasks, please be aware that it might not be legal to do. Make sure to check with the site providers whether it is fine and use appropriate timeouts in your scripts to not unduly hit the site providers.

That out of the way, see below an example (you might need to use your header configuration) of how you could get the information (note that javascript execution might not be the safest way and writing your own parser of information might be the better option):

import requests
from bs4 import BeautifulSoup
from quickjs import Function

def get_value(script_code):
    f = Function(name="value_getter",
                 code=
        """
        function value_getter(){
            %s
        }""" % (re.sub(r'accounting\.formatNumber\((?P<num>.*), 2, " ", ","\)', 
                       r'"\g<1>"', 
                       script_code.replace('document.getElementById("snittlon").innerHTML =', 
                                           'return')))
    )
    return f()

headers = {"user-agent": '[redacted: your user agent]'}
session = requests.Session()
session.get(r"https://kollainkomster.se/postnummer/", headers=headers)
r = session.post(r"https://kollainkomster.se/postnummer/", 
                 headers=headers, 
                 data={"cf-name": 17568,                                                                   
                       "cf-submitted": "Sök"})
soup = BeautifulSoup(r.text)
income = get_value(soup.find(class_="postnrresultat resultat").find("script").text)
post_number, _, ranking = [x.text for x in 
                           soup.find(class_="postnrresultat resultat").findAll("strong")]
print(income, post_number, ranking)

Output:

393568.4065 17568 1414

Upvotes: 2

Related Questions