Web Scraper - getting duplicates in output

I am completely new to Python and just trying my coding skills for developing a few programs

I have coded the following program in Python 2.7 to fetch the profile URLs from the dir - http://www.uschirodirectory.com/entire-directory/list/alpha/a.html

However, I am noticing a lot of duplicate entries in the list of URLs fetched. Could someone please review the code and tell me if there's something that I am doing here or is there a way this code could be optimized further.

Many thanks

import requests
from bs4 import BeautifulSoup

def web_crawler(max_pages):
p = '?site='
page = 1
alpha = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
while page <= max_pages:
    for i in alpha:
        url = 'http://www.uschirodirectory.com/entire-directory/list/alpha/' + str(i) + '.html' + p + str(page)
        code = requests.get(url)
        text = code.text
        soup = BeautifulSoup(text)
        for link in soup.findAll('a',{'class':'btn'}):
            href = 'http://www.uschirodirectory.com' + link.get('href')
            print(href)
    page += 1
i += alpha[0 + 1]

#Run the crawler
web_crawler

Upvotes: 1

Answers (2)

Ranvijay sachan

Reputation: 2444

You can store the data in a list and also you can remove the duplicate url using this code :

parsedData = []

data = {}

if not any(d['url'] == data['url'] for d in data):

   parsedData.append(data)

Upvotes: 2

Igor Savinkin

Reputation: 6267

Basically your code is ok. You might get lots of duplicate links cause the directory results are designed to issue results not just for 1-st letter in doctor name but also for 1-st letter in company title or other important db field.

Upvotes: 2

Web Scraper - getting duplicates in output

Answers (2)

Related Questions