How to scrape the non loaded content of the page?

I have made a script that will scrape link to concerns. But the page are not fully loaded. In bottom of the page is button "Pokaż więcej" that means show more. If I click that it will load more concerns. This script scrape only the first part. How can I scrape whole list?

import requests
from bs4 import BeautifulSoup
url = "https://www.gpw.pl/spolki"

response = requests.get(url)
soup = BeautifulSoup(response.content)
x = soup.find_all("a")

tuple = []
for link in x:
    tuple.append(link.get("href"))

finalTuple = []

for x in tuple:
    if "spolka?isin=" in x:
        finalTuple.append(x)

print(finalTuple)

Upvotes: 0

Answers (3)

SIM

Reputation: 22440

Although furas has already provided you with a working solution, I thought to come up with mine. If you try the way I've shown below, you don't need to hardcode the parameters. The script should fetch you all the required links traversing all the pages no matter how many they are.

import requests
from bs4 import BeautifulSoup

link = 'https://www.gpw.pl/spolki'
post_url = 'https://www.gpw.pl/ajaxindex.php'
offset = 0

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}

    for item in list(payload):
        if item.startswith('country'):
            payload[item] = 'on'
        elif item.startswith('voivodship'):
            payload[item] = 'on'
        elif item.startswith('index'):
            payload[item] = 'on'

    while True:
        r = s.post(post_url,data=payload)
        soup = BeautifulSoup(r.text,"lxml")
        if not soup.select("tr"):break
        for elem in soup.select("tr"):
            target_links = elem.select_one("a[href^='spolka?isin=']")['href']
            print(target_links)

        offset+=10
        payload['offset'] = offset

Upvotes: 3

furas

Reputation: 142641

This page uses JavaScript (AJAX/XHR) to load data when you click Pokaż więcej

Using DevTool in Firefox/Chrome (tab: Network, filter: XHR) I check urls when I press Pokaż więcej and I found

https://www.gpw.pl/ajaxindex.php

Browser sends POST request with data offset and limit to read next data.

Using offset = 0 you can read even data for first page.

Maybe using bigger limit (and bigger offset) you could read more data in one requests.

POST sends many other values which I also add in code

import requests
from bs4 import BeautifulSoup

def get_first_page():

    url = "https://www.gpw.pl/spolki"

    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    x = soup.find_all("a")

    tuple = []
    for link in x:
        tuple.append(link.get("href"))

    finalTuple = []

    for x in tuple:
        if "spolka?isin=" in x:
            finalTuple.append(x)

    print('\n'.join(finalTuple))


def get_next_data(offset):

    url = 'https://www.gpw.pl/ajaxindex.php'

    data = {
        "offset": offset,
        "limit": "10",

        "action": "GPWCompanySearch",
        "start": "ajaxSearch",
        "page": "spolki",
        "format": "html",
        "lang": "PL",
        "letter": "",
        "order": "",
        "order_type": "",
        "searchText": "",
        "index[empty]": "on",
        "index[WIG20]":"on",
        "index[mWIG40]":"on",
        "index[sWIG80]":"on",
        "index[WIG30]":"on",
        "index[WIG]":"on",
        "index[WIGdiv]":"on",
        "index[WIG-CEE]":"on",
        "index[WIG-Poland]":"on",
        "index[InvestorMS]":"on",
        "index[TBSP.Index]":"on",
        "index[CEEplus]":"on",
        "index[mWIG40TR]":"on",
        "index[NCIndex]":"on",
        "index[sWIG80TR]":"on",
        "index[WIG-banki]":"on",
        "index[WIG-budownictwo]":"on",
        "index[WIG-chemia]":"on",
        "index[WIG-energia]":"on",
        "index[WIG-ESG]":"on",
        "index[WIG-górnictwo]":"on",
        "index[WIG-informatyka]":"on",
        "index[WIG-leki]":"on",
        "index[WIG-media]":"on",
        "index[WIG-motoryzacja]":"on",
        "index[WIG-nieruchomości]":"on",
        "index[WIG-odzież]":"on",
        "index[WIG-paliwa]":"on",
        "index[WIG-spożywczy]":"on",
        "index[WIG-telekomunikacja]":"on",
        "index[WIG-Ukraine]":"on",
        "index[WIG.GAMES]":"on",
        "index[WIG.MS-BAS]":"on",
        "index[WIG.MS-FIN]":"on",
        "index[WIG.MS-PET]":"on",
        "index[WIG20TR]":"on",
        "index[WIG30TR]":"on",
        "index[WIGtech]":"on",
        "sector[510]":"510","sector[110]":"110","sector[750]":"750","sector[410]":"410","sector[310]":"310","sector[360]":"360","sector[740]":"740","sector[180]":"180","sector[220]":"220","sector[650]":"650","sector[350]":"350","sector[320]":"320","sector[610]":"610","sector[690]":"690","sector[660]":"660","sector[330]":"330","sector[820]":"820","sector[399]":"399","sector[150]":"150","sector[640]":"640","sector[540]":"540","sector[140]":"140","sector[830]":"830","sector[520]":"520","sector[210]":"210","sector[170]":"170","sector[730]":"730","sector[420]":"420","sector[185]":"185","sector[370]":"370","sector[630]":"630","sector[130]":"130","sector[620]":"620","sector[720]":"720","sector[710]":"710","sector[810]":"810","sector[430]":"430","sector[120]":"120","sector[450]":"450","sector[160]":"160","sector[530]":"530","sector[440]":"440",
        "country[POLSKA]":"on","country[AUSTRALIA]":"on","country[AUSTRIA]":"on","country[Belgia]":"on","country[BUŁGARIA]":"on","country[CYPR]":"on","country[CZECHY]":"on","country[DANIA]":"on","country[ESTONIA]":"on","country[FRANCJA]":"on","country[GLOBAL]":"on","country[GUERNSEY]":"on","country[HISZPANIA]":"on","country[HOLANDIA]":"on","country[INNY]":"on","country[IRLANDIA]":"on","country[KANADA]":"on","country[LITWA]":"on","country[LUKSEMBURG]":"on","country[NIEMCY]":"on","country[Norwegia]":"on","country[REPUBLIKA+CZESKA]":"on","country[SŁOWACJA]":"on","country[Słowenia]":"on","country[STANY+ZJEDNOCZONE]":"on","country[SZWAJCARIA]":"on","country[SZWECJA]":"on","country[UKRAINA]":"on","country[WĘGRY]":"on","country[WIELKA+BRYTANIA]":"on","country[WŁOCHY]":"on","country[JERSEY]":"on",
        "voivodship[11]":"on","voivodship[16]":"on","voivodship[5]":"on","voivodship[13]":"on","voivodship[17]":"on","voivodship[7]":"on","voivodship[2]":"on","voivodship[10]":"on","voivodship[8]":"on","voivodship[4]":"on","voivodship[15]":"on","voivodship[9]":"on","voivodship[6]":"on","voivodship[3]":"on","voivodship[12]":"on","voivodship[14]":"on"
    }

    response = requests.post(url, data=data)
    soup = BeautifulSoup(response.content)

    x = soup.find_all("a")

    tuple = []
    for link in x:
        tuple.append(link.get("href"))

    finalTuple = []

    for x in tuple:
        if "spolka?isin=" in x:
            finalTuple.append(x)

    print('\n'.join(finalTuple))

#get_first_page() # you don't need it if you use `offset 0`

for offset in range(0, 10, 10):
    print('---', offset, '---')
    get_next_data(offset)

Upvotes: 2

aneroid

Reputation: 15962

You have some options, #3 is the most likely viable:

find a way to trigger the Javascript which loads more data and get that page data with requests
find what URL is actually fetching the new data (if there is one) and use that with requests
- edit: see Furas' comment above. In your case, that may be a simpler/quicker solution. You could also completely avoid loading the original page in BS4 if that AJAX endpoint provides all "concerns" data
or use Selenium. That automates the browser, as well as user actions. It may be slower than just bs4 + requests but it creates an actual browser session with a loaded web page, javascript executed in the browser, etc. So you don't need to figure out which JS does what.

In taking a brief look at the page you linked, it's non-obvious what source is providing the new data. So selenium is the favoured choice.

See these answers which solve similar problems:

Btw, tuple = [] and finalTuple = [] ? C'mon! You are creating an empty list [] and assigning it to a variable tuple which is a different data structure and overriding the name/constructor of that data structure. links and final_links would have been better names - and more meaningful - without overriding the name of Python built-ins, or using a var name that conflicts with the structure of the data it holds/points to.

Upvotes: 3

How to scrape the non loaded content of the page?

Answers (3)

Related Questions