Removing duplicate URLs in Python (non list)

Question

I need help on removing duplicate URLs in my output. I would try to represent it such that I don't have to put everything in a list, if possible. I feel like it can be achieved with some logical statement, just not sure how to make it happen. Using Python 3.6.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.parse import urljoin as join

my_url = 'https://www.census.gov/programs-surveys/popest.html'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

filename = "LinkScraping.csv"
f = open(filename, "w")
headers = "Web_Links
"
f.write(headers)

links = page_soup.findAll('a')

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
        if ab_url:
        f.write(str(ab_url) + "
")

f.close()

DeepSpace · Accepted Answer

You can't achieve this without using any data structure of some sort unless you want to write to the file and re-read it over and over again (which is far less preferable than using an in-memory data structure).

Use a set:

.
.
.

urls_set = set()

for link in links:
    web_links = link.get("href")
    ab_url = join(my_url, web_links)
    print(ab_url)
    if ab_url and ab_url not in urls_set:
        f.write(str(ab_url) + "
")
        urls_set.add(ab_url)

Removing duplicate URLs in Python (non list)

Answers (1)

Related Questions