How to restrain duplicate links from getting parsed?

Question

I've written some script in python to scrape the next page links available in that webpage which is running well at this moment. The only issue with this scraper is it can't shake off duplicate links. Hope somebody will help me accomplish this. I've tried with:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            print(item.attrib["href"])

nextpage_links(page_link)

This is the partial image of what I'm getting:

Sumit Gupta · Accepted Answer

You can use set for the purpose:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    links = set()
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            links.add(item.attrib["href"])

    return links

nextpage_links(page_link)

You can also use scrapy which will by default restrict duplicates.

How to restrain duplicate links from getting parsed?

Answers (1)

Related Questions