Marcon Fikingas
Marcon Fikingas

Reputation: 13

Webscraping with Python 3

after scraping a website, I have retrieved all html links. After setting them into a set(), to remove any duplicates, I am still retrieving certain values. How do I remove the values of '#', '#content', '#uscb-nav-skip-header', '/', None, from set of link.

from bs4 import BeautifulSoup
import urllib
import re

#Gets the html code for scrapping
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()

#Creates a beautifulsoup object to run
soup = BeautifulSoup(r, 'html.parser')

#Set removes duplicates
lst2 = set()
for link in soup.find_all('a'):
    lst2.add(link.get('href'))
lst2

{'#',
 '#content',
 '#uscb-nav-skip-header',
 '/',
 '/data/tables/time-series/demo/popest/pre-1980-county.html',
 '/data/tables/time-series/demo/popest/pre-1980-national.html',
 '/data/tables/time-series/demo/popest/pre-1980-state.html',
 '/en.html',
 '/library/publications/2010/demo/p25-1138.html',
 '/library/publications/2010/demo/p25-1139.html',
 '/library/publications/2015/demo/p25-1142.html',
 '/programs-surveys/popest/data.html',
 '/programs-surveys/popest/data/tables.html',
 '/programs-surveys/popest/geographies.html',
 '/programs-surveys/popest/guidance-geographies.html',
 None,
 'https://twitter.com/uscensusbureau',
 ...}

Upvotes: 1

Views: 116

Answers (5)

QHarr
QHarr

Reputation: 84465

You could examine the html and use :not (bs4 4.7.1+) to filter out various href based on their values and apply a final test on href length

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = bs(r.content, 'lxml')
links = [i['href'] for i in soup.select('a[href]:not([class*="-nav-"],[class*="-pagination-"])') if len(i['href']) > 1]
print(links)

Upvotes: 0

paltaa
paltaa

Reputation: 3244

you can use list comprehension:

new_set = [link if '#' not in link for link in lst2 ]

Upvotes: 0

kaya3
kaya3

Reputation: 51034

The character # (and everything after it) in a URL is relevant to a browser, but not to the server when making a web-request, so it is fine to cut those parts out of URLs. This will leave URLs like '#content' blank, but also change '/about#contact' into just '/about', which is actually what you want. From there, we just need an if statement to only add the non-empty strings to the set. This will also filter out None at the same time:

lst2 = set()
for link in soup.find_all('a'):
    url = link.get('href')
    url = url.split('#')[0]
    if url:
        lst2.add(url)

If you specifically want to exclude '/' (although it is a valid URL), you can simply write lst2.discard('/') at the end. Since lst2 is a set, this will remove it if it's there, or do nothing if it isn't.

Upvotes: 2

game0ver
game0ver

Reputation: 1290

Try with the following:

set(link.get('href') for link in soup.findAll(name='link') if link.has_attr("href"))

Upvotes: 0

ooi18
ooi18

Reputation: 142

You can loop through your set and use regex to filter each element in the set. For the None, you can simply check if the value is None or not.

Upvotes: 0

Related Questions