How to remove duplicate titles while scraping it from web-page

Question

I Want to remove duplicate titles to be removed from the output, i am using Beautiful soup to scrape the titles.

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests


    source = requests.get('https://itrevolution.com/book-downloads-extra-materials/')
    source = source.text
    
    soup = BeautifulSoup(source, 'lxml')
    
    for tl in soup.find_all('img', class_='responsive-img hover-img'):
      title = set()
      title = tl.get('title')
      print('{}'.format(title))

Output: Output from the above script..

Accelerate
Team Topologies
Accelerate
Project to Product
War and Peace and IT
A Seat at the Table
The Art of Business Value
DevOps for the Modern Enterprise
Making Work Visible
Leading the Transformation
The DevOps Handbook
The Phoenix Project
Beyond the Phoenix Project

We have title Accelerate which appears twice which needs to be appear one.

arnaud · Accepted Answer

You were on the right track, taking advantage of a set() is a great idea. Just create it before the for-loop, and add titles in it using method set.add(). See the following:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://itrevolution.com/book-downloads-extra-materials/')
source = source.text

soup = BeautifulSoup(source, 'lxml')
titles = set()

for tl in soup.find_all('img', class_='responsive-img hover-img'):
    title = tl.get('title')
    titles.add(title)

print(titles)

How to remove duplicate titles while scraping it from web-page

Answers (2)

Related Questions