Reputation: 117
I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>2</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=2&pgesize=36&sort=-1'>3</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=3&pgesize=36&sort=-1'>4</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=4&pgesize=36&sort=-1'>5</a></li><li #LIVALUES#>...</li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=28&pgesize=36&sort=-1'>29</a></li><li class="page-skip"><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</a></li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Upvotes: 0
Views: 2819
Reputation: 46759
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30
Upvotes: 0
Reputation: 117
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Upvotes: 2
Reputation: 2073
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
Upvotes: 0