user5405648
user5405648

Reputation:

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.

forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11

Really, what I would ideally want to scrape is just one of these.

forums/my-first-forum/: threads/my-gap-year-uni-story.13846/

Here is my script:

from bs4 import BeautifulSoup
import requests

def get_source(url):
    return requests.get(url).content

def is_forum_link(self):
    return self.find('special string') != -1

def fetch_all_links_with_word(url, word):
    source = get_source(url)
    soup = BeautifulSoup(source, 'lxml')
    return soup.select("a[href*=" + word + "]")

main_url = "http://example.com/forum/"

forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []

for link in forumLinks: 
    if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
        forums.append(link.attrs['href']);

print('Fetched ' + str(len(forums)) + ' forums')

threads = {}

for link in forums: 
    threadLinks = fetch_all_links_with_word(main_url + link, "threads")

    for threadLink in threadLinks:
        print(link + ': ' + threadLink.attrs['href'])
        threads[link] = threadLink

print('Fetched ' + str(len(threads)) + ' threads')

Upvotes: 2

Views: 90

Answers (1)

paul41
paul41

Reputation: 676

This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.

Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.

forums = set()

for link in forumLinks: 
    if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
        url = link.attrs['href']
        position = url.rfind('/page-')
        if position > 0 and url[position + 6:position + 7].isdigit():
            url = url[:position + 1]
        forums.add(url);

Upvotes: 1

Related Questions