Reputation: 19
I am trying to parse multiple pages of a website but I can't understand how to change the query of the url (if this makes sense?)
I tried to create a next_page that took the first page and added +1 everytime it found the next page element, but I think I can't because I'll have multiple start urls (all similar). When i try to get the information of the next page element it returns this:
["loadmoreresult('?networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'); return false;"]
Using url.parse(response.url).query I get:
'networkId=24&pageNumber=1&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword='
All I need to do is create a new link that uses the same scheme, path and then changes the query.
If you need more info please tell me, I don't really know what is more relevant to you as I am still a beginner.
from urllib.parse import urlparse, urljoin
urlparse(response.url)
>>> ParseResult(scheme='https', netloc='www.wcaworld.com', path='/Directory', params='', query='networkId=24&pageNumber=1&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=', fragment='')
response.css('a.loadmore::attr(onmouseover)').extract()
>>>["loadmoreresult('?networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'); return false;"]
Upvotes: 0
Views: 242
Reputation: 11101
You need to get the base url of that <a>
element, which is the part of a url before query string starts https://example.com/a/path/?query=param
so here the base url would be https://example.com/a/path/
. Save that into a variable. Then use urllib.parse.parse_qsl
to parse the query string, then update the page number and join it with base url.
from urllib.parse import parse_qsl, urljoin, urlencode
BASE_URL = 'https://example.com/a/path/'
# you can also extract base url from scrapy.Response object
# BASE_URL, _ = splitquery(response.url)
if __name__ == '__main__':
# extract query parameter from from a url
q = 'networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'
parsed = dict(parse_qsl(q))
next_page = int(parsed['pageNumber']) + 1
parsed['pageNumber'] = next_page
next_page_url = urljoin(BASE_URL, '?' + urlencode(parsed))
print(next_page_url)
output:
https://example.com/a/path/networkId=24&pageNumber=3&pageSize=100&allnet=yes&networkIds=38&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&lastCid=116490
Upvotes: 1