Excluding 'duplicated' scraped URLs in Python app?

Question

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.

forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11

Really, what I would ideally want to scrape is just one of these.

forums/my-first-forum/: threads/my-gap-year-uni-story.13846/

Here is my script:

from bs4 import BeautifulSoup
import requests

def get_source(url):
    return requests.get(url).content

def is_forum_link(self):
    return self.find('special string') != -1

def fetch_all_links_with_word(url, word):
    source = get_source(url)
    soup = BeautifulSoup(source, 'lxml')
    return soup.select("a[href*=" + word + "]")

main_url = "http://example.com/forum/"

forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []

for link in forumLinks: 
    if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
        forums.append(link.attrs['href']);

print('Fetched ' + str(len(forums)) + ' forums')

threads = {}

for link in forums: 
    threadLinks = fetch_all_links_with_word(main_url + link, "threads")

    for threadLink in threadLinks:
        print(link + ': ' + threadLink.attrs['href'])
        threads[link] = threadLink

print('Fetched ' + str(len(threads)) + ' threads')

Excluding 'duplicated' scraped URLs in Python app?

Answers (1)

Related Questions

Excluding &#39;duplicated&#39; scraped URLs in Python app?

Answers (1)

Related Questions

Excluding 'duplicated' scraped URLs in Python app?