Reputation: 117

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.

Here is an example starting page.

There are 29 sub pages within that leading page, ideally the function would therefore return 29.

By subpage I mean, page 1 of 29, 2 of 29 etc etc.

This is the HTML snippet which contains the last page information, from the link posted above.

<div id="paging-wrapper-btm" class="paging-wrapper">
        <ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>2</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=2&pgesize=36&sort=-1'>3</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=3&pgesize=36&sort=-1'>4</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=4&pgesize=36&sort=-1'>5</a></li><li #LIVALUES#>...</li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=28&pgesize=36&sort=-1'>29</a></li><li class="page-skip"><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</a></li></ol>

I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .

a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >

Any help/suggestions much appreciated.

Upvotes: 0

Answers (3)

Martin Evans

Reputation: 46759

The following would extract the last page number:

from bs4 import BeautifulSoup 
import requests


html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)

ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]

print last_page

Which for your website will display:

Upvotes: 0

MarcelKlockman

Reputation: 117

Ah.. I found a simple solution.

for item in soup.select("ol a"):
    x = item.text
    print x

I can then sort and select the largest number.

Upvotes: 2

trans1st0r

Reputation: 2073

Try this:

ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
 all_as.extend(a)
print all_as

Upvotes: 0

get last page number - web scraping

Answers (3)

Related Questions