Reputation: 61
I have made a script that will scrape link to concerns. But the page are not fully loaded. In bottom of the page is button "Pokaż więcej" that means show more. If I click that it will load more concerns. This script scrape only the first part. How can I scrape whole list?
import requests
from bs4 import BeautifulSoup
url = "https://www.gpw.pl/spolki"
response = requests.get(url)
soup = BeautifulSoup(response.content)
x = soup.find_all("a")
tuple = []
for link in x:
tuple.append(link.get("href"))
finalTuple = []
for x in tuple:
if "spolka?isin=" in x:
finalTuple.append(x)
print(finalTuple)
Upvotes: 0
Views: 1664
Reputation: 22440
Although furas has already provided you with a working solution, I thought to come up with mine. If you try the way I've shown below, you don't need to hardcode the parameters. The script should fetch you all the required links traversing all the pages no matter how many they are.
import requests
from bs4 import BeautifulSoup
link = 'https://www.gpw.pl/spolki'
post_url = 'https://www.gpw.pl/ajaxindex.php'
offset = 0
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
for item in list(payload):
if item.startswith('country'):
payload[item] = 'on'
elif item.startswith('voivodship'):
payload[item] = 'on'
elif item.startswith('index'):
payload[item] = 'on'
while True:
r = s.post(post_url,data=payload)
soup = BeautifulSoup(r.text,"lxml")
if not soup.select("tr"):break
for elem in soup.select("tr"):
target_links = elem.select_one("a[href^='spolka?isin=']")['href']
print(target_links)
offset+=10
payload['offset'] = offset
Upvotes: 3
Reputation: 142641
This page uses JavaScript
(AJAX
/XHR
) to load data when you click Pokaż więcej
Using DevTool
in Firefox
/Chrome
(tab: Network
, filter: XHR
) I check urls when I press Pokaż więcej
and I found
https://www.gpw.pl/ajaxindex.php
Browser sends POST
request with data offset
and limit
to read next data.
Using offset = 0
you can read even data for first page.
Maybe using bigger limit
(and bigger offset
) you could read more data in one requests.
POST
sends many other values which I also add in code
import requests
from bs4 import BeautifulSoup
def get_first_page():
url = "https://www.gpw.pl/spolki"
response = requests.get(url)
soup = BeautifulSoup(response.content)
x = soup.find_all("a")
tuple = []
for link in x:
tuple.append(link.get("href"))
finalTuple = []
for x in tuple:
if "spolka?isin=" in x:
finalTuple.append(x)
print('\n'.join(finalTuple))
def get_next_data(offset):
url = 'https://www.gpw.pl/ajaxindex.php'
data = {
"offset": offset,
"limit": "10",
"action": "GPWCompanySearch",
"start": "ajaxSearch",
"page": "spolki",
"format": "html",
"lang": "PL",
"letter": "",
"order": "",
"order_type": "",
"searchText": "",
"index[empty]": "on",
"index[WIG20]":"on",
"index[mWIG40]":"on",
"index[sWIG80]":"on",
"index[WIG30]":"on",
"index[WIG]":"on",
"index[WIGdiv]":"on",
"index[WIG-CEE]":"on",
"index[WIG-Poland]":"on",
"index[InvestorMS]":"on",
"index[TBSP.Index]":"on",
"index[CEEplus]":"on",
"index[mWIG40TR]":"on",
"index[NCIndex]":"on",
"index[sWIG80TR]":"on",
"index[WIG-banki]":"on",
"index[WIG-budownictwo]":"on",
"index[WIG-chemia]":"on",
"index[WIG-energia]":"on",
"index[WIG-ESG]":"on",
"index[WIG-górnictwo]":"on",
"index[WIG-informatyka]":"on",
"index[WIG-leki]":"on",
"index[WIG-media]":"on",
"index[WIG-motoryzacja]":"on",
"index[WIG-nieruchomości]":"on",
"index[WIG-odzież]":"on",
"index[WIG-paliwa]":"on",
"index[WIG-spożywczy]":"on",
"index[WIG-telekomunikacja]":"on",
"index[WIG-Ukraine]":"on",
"index[WIG.GAMES]":"on",
"index[WIG.MS-BAS]":"on",
"index[WIG.MS-FIN]":"on",
"index[WIG.MS-PET]":"on",
"index[WIG20TR]":"on",
"index[WIG30TR]":"on",
"index[WIGtech]":"on",
"sector[510]":"510","sector[110]":"110","sector[750]":"750","sector[410]":"410","sector[310]":"310","sector[360]":"360","sector[740]":"740","sector[180]":"180","sector[220]":"220","sector[650]":"650","sector[350]":"350","sector[320]":"320","sector[610]":"610","sector[690]":"690","sector[660]":"660","sector[330]":"330","sector[820]":"820","sector[399]":"399","sector[150]":"150","sector[640]":"640","sector[540]":"540","sector[140]":"140","sector[830]":"830","sector[520]":"520","sector[210]":"210","sector[170]":"170","sector[730]":"730","sector[420]":"420","sector[185]":"185","sector[370]":"370","sector[630]":"630","sector[130]":"130","sector[620]":"620","sector[720]":"720","sector[710]":"710","sector[810]":"810","sector[430]":"430","sector[120]":"120","sector[450]":"450","sector[160]":"160","sector[530]":"530","sector[440]":"440",
"country[POLSKA]":"on","country[AUSTRALIA]":"on","country[AUSTRIA]":"on","country[Belgia]":"on","country[BUŁGARIA]":"on","country[CYPR]":"on","country[CZECHY]":"on","country[DANIA]":"on","country[ESTONIA]":"on","country[FRANCJA]":"on","country[GLOBAL]":"on","country[GUERNSEY]":"on","country[HISZPANIA]":"on","country[HOLANDIA]":"on","country[INNY]":"on","country[IRLANDIA]":"on","country[KANADA]":"on","country[LITWA]":"on","country[LUKSEMBURG]":"on","country[NIEMCY]":"on","country[Norwegia]":"on","country[REPUBLIKA+CZESKA]":"on","country[SŁOWACJA]":"on","country[Słowenia]":"on","country[STANY+ZJEDNOCZONE]":"on","country[SZWAJCARIA]":"on","country[SZWECJA]":"on","country[UKRAINA]":"on","country[WĘGRY]":"on","country[WIELKA+BRYTANIA]":"on","country[WŁOCHY]":"on","country[JERSEY]":"on",
"voivodship[11]":"on","voivodship[16]":"on","voivodship[5]":"on","voivodship[13]":"on","voivodship[17]":"on","voivodship[7]":"on","voivodship[2]":"on","voivodship[10]":"on","voivodship[8]":"on","voivodship[4]":"on","voivodship[15]":"on","voivodship[9]":"on","voivodship[6]":"on","voivodship[3]":"on","voivodship[12]":"on","voivodship[14]":"on"
}
response = requests.post(url, data=data)
soup = BeautifulSoup(response.content)
x = soup.find_all("a")
tuple = []
for link in x:
tuple.append(link.get("href"))
finalTuple = []
for x in tuple:
if "spolka?isin=" in x:
finalTuple.append(x)
print('\n'.join(finalTuple))
#get_first_page() # you don't need it if you use `offset 0`
for offset in range(0, 10, 10):
print('---', offset, '---')
get_next_data(offset)
Upvotes: 2
Reputation: 15962
You have some options, #3 is the most likely viable:
In taking a brief look at the page you linked, it's non-obvious what source is providing the new data. So selenium is the favoured choice.
See these answers which solve similar problems:
Btw, tuple = []
and finalTuple = []
? C'mon! You are creating an empty list []
and assigning it to a variable tuple
which is a different data structure and overriding the name/constructor of that data structure. links
and final_links
would have been better names - and more meaningful - without overriding the name of Python built-ins, or using a var name that conflicts with the structure of the data it holds/points to.
Upvotes: 3