Reputation: 22440
I've written some script in python to get all the links leading to the next page. However, it works fine only to a certain extent. The highest number of next page links is 255. Running my script, I get first 23 links along with the last page link but between them [24 to 254] are missing. How can I get all of them? Here is what I'm trying with:
import requests
from lxml import html
page_link = "https://www.yify-torrent.org/search/1080p/"
b_link = "https://www.yify-torrent.org"
def get_links(main_link):
links = []
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('div.pager a'):
if item.attrib["href"] not in links:
links.append(item.attrib["href"])
for link in links:
print(b_link + link)
get_links(page_link)
Elements within the next page links lies:
<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>
The results I'm getting are like [curtailed to the last five links]:
https://www.yify-torrent.org/search/1080p/t-20/
https://www.yify-torrent.org/search/1080p/t-21/
https://www.yify-torrent.org/search/1080p/t-22/
https://www.yify-torrent.org/search/1080p/t-23/
https://www.yify-torrent.org/search/1080p/t-255/
Upvotes: 2
Views: 506
Reputation: 52685
Answer provided by @kaze obviously should return you 255 pages, but if you need to get all links dynamically without hardcoding total pages number, you might try
r = requests.get("https://www.yify-torrent.org/search/1080p/")
tree = html.fromstring(r.content)
page_number = tree.xpath("//div[@class='pager']/a[.='Last']/@href")[0].split("/")[-2].replace("t-", "")
for page in range(int(page_number) + 1):
requests.get("https://www.yify-torrent.org/search/1080p/t-%s/" % page)
Upvotes: 2
Reputation: 35
if the link structure isn't infereable you would have to 'walk the site', but here you might as well generate the links yourself, like so:
for i in range(1,256):
print('https://www.yify-torrent.org/search/1080p/t-%s/' % i)
Upvotes: 0
Reputation: 39
Your script looks correct as it is. Looking at the HTML for that page, I see this:
<a href="/search/1080p/t-2/">2</a>
<a href="/search/1080p/t-3/">3</a>
<a href="/search/1080p/t-4/">4</a>
<a href="/search/1080p/t-5/">5</a>
<a href="/search/1080p/t-6/">6</a>
<a href="/search/1080p/t-7/">7</a>
<a href="/search/1080p/t-8/">8</a>
<a href="/search/1080p/t-9/">9</a>
<a href="/search/1080p/t-10/">10</a>
<a href="/search/1080p/t-11/">11</a>
<a href="/search/1080p/t-12/">12</a>
<a href="/search/1080p/t-13/">13</a>
<a href="/search/1080p/t-14/">14</a>
<a href="/search/1080p/t-15/">15</a>
<a href="/search/1080p/t-16/">16</a>
<a href="/search/1080p/t-17/">17</a>
<a href="/search/1080p/t-18/">18</a>
<a href="/search/1080p/t-19/">19</a>
<a href="/search/1080p/t-20/">20</a>
<a href="/search/1080p/t-21/">21</a>
<a href="/search/1080p/t-22/">22</a>
<a href="/search/1080p/t-23/">23</a>
<a href="/search/1080p/t-2/">Next</a>
<a href="/search/1080p/t-255/">Last</a>
It seems t-2
is a pointer to the Next
page, which will contain the rest of the links.
Upvotes: -1