Reputation: 22440
I've written some script in python to scrape the next page links available in that webpage which is running well at this moment. The only issue with this scraper is it can't shake off duplicate links. Hope somebody will help me accomplish this. I've tried with:
import requests
from lxml import html
page_link = "https://yts.ag/browse-movies"
def nextpage_links(main_link):
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('ul.tsc_pagination a'):
if "page" in item.attrib["href"]:
print(item.attrib["href"])
nextpage_links(page_link)
This is the partial image of what I'm getting:
Upvotes: 0
Views: 72
Reputation: 44
You can use set for the purpose:
import requests
from lxml import html
page_link = "https://yts.ag/browse-movies"
def nextpage_links(main_link):
links = set()
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('ul.tsc_pagination a'):
if "page" in item.attrib["href"]:
links.add(item.attrib["href"])
return links
nextpage_links(page_link)
You can also use scrapy
which will by default restrict duplicates.
Upvotes: 1