user8496595
user8496595

Reputation: 19

Parsing different pages with urlparse

I am trying to parse multiple pages of a website but I can't understand how to change the query of the url (if this makes sense?)

I tried to create a next_page that took the first page and added +1 everytime it found the next page element, but I think I can't because I'll have multiple start urls (all similar). When i try to get the information of the next page element it returns this:

["loadmoreresult('?networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'); return false;"]

Using url.parse(response.url).query I get:

'networkId=24&pageNumber=1&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword='

All I need to do is create a new link that uses the same scheme, path and then changes the query.

If you need more info please tell me, I don't really know what is more relevant to you as I am still a beginner.

from urllib.parse import urlparse, urljoin

urlparse(response.url)
>>> ParseResult(scheme='https', netloc='www.wcaworld.com', path='/Directory', params='', query='networkId=24&pageNumber=1&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=', fragment='')

response.css('a.loadmore::attr(onmouseover)').extract()
>>>["loadmoreresult('?networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'); return false;"]

Upvotes: 0

Views: 242

Answers (1)

abdusco
abdusco

Reputation: 11101

You need to get the base url of that <a> element, which is the part of a url before query string starts https://example.com/a/path/?query=param so here the base url would be https://example.com/a/path/. Save that into a variable. Then use urllib.parse.parse_qsl to parse the query string, then update the page number and join it with base url.

from urllib.parse import parse_qsl, urljoin, urlencode

BASE_URL = 'https://example.com/a/path/'
# you can also extract base url from scrapy.Response object
# BASE_URL, _ = splitquery(response.url)

if __name__ == '__main__':
    # extract query parameter from from a url
    q = 'networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'
    parsed = dict(parse_qsl(q))
    next_page = int(parsed['pageNumber']) + 1
    parsed['pageNumber'] = next_page

    next_page_url = urljoin(BASE_URL, '?' + urlencode(parsed))

    print(next_page_url)

output:

https://example.com/a/path/networkId=24&pageNumber=3&pageSize=100&allnet=yes&networkIds=38&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&lastCid=116490

Upvotes: 1

Related Questions