Reputation: 13
after scraping a website, I have retrieved all html links. After setting them into a set(), to remove any duplicates, I am still retrieving certain values. How do I remove the values of '#', '#content', '#uscb-nav-skip-header', '/', None, from set of link.
from bs4 import BeautifulSoup
import urllib
import re
#Gets the html code for scrapping
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
#Creates a beautifulsoup object to run
soup = BeautifulSoup(r, 'html.parser')
#Set removes duplicates
lst2 = set()
for link in soup.find_all('a'):
lst2.add(link.get('href'))
lst2
{'#',
'#content',
'#uscb-nav-skip-header',
'/',
'/data/tables/time-series/demo/popest/pre-1980-county.html',
'/data/tables/time-series/demo/popest/pre-1980-national.html',
'/data/tables/time-series/demo/popest/pre-1980-state.html',
'/en.html',
'/library/publications/2010/demo/p25-1138.html',
'/library/publications/2010/demo/p25-1139.html',
'/library/publications/2015/demo/p25-1142.html',
'/programs-surveys/popest/data.html',
'/programs-surveys/popest/data/tables.html',
'/programs-surveys/popest/geographies.html',
'/programs-surveys/popest/guidance-geographies.html',
None,
'https://twitter.com/uscensusbureau',
...}
Upvotes: 1
Views: 116
Reputation: 84465
You could examine the html and use :not (bs4 4.7.1+) to filter out various href based on their values and apply a final test on href length
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = bs(r.content, 'lxml')
links = [i['href'] for i in soup.select('a[href]:not([class*="-nav-"],[class*="-pagination-"])') if len(i['href']) > 1]
print(links)
Upvotes: 0
Reputation: 3244
you can use list comprehension:
new_set = [link if '#' not in link for link in lst2 ]
Upvotes: 0
Reputation: 51034
The character #
(and everything after it) in a URL is relevant to a browser, but not to the server when making a web-request, so it is fine to cut those parts out of URLs. This will leave URLs like '#content'
blank, but also change '/about#contact'
into just '/about'
, which is actually what you want. From there, we just need an if
statement to only add the non-empty strings to the set. This will also filter out None
at the same time:
lst2 = set()
for link in soup.find_all('a'):
url = link.get('href')
url = url.split('#')[0]
if url:
lst2.add(url)
If you specifically want to exclude '/'
(although it is a valid URL), you can simply write lst2.discard('/')
at the end. Since lst2
is a set, this will remove it if it's there, or do nothing if it isn't.
Upvotes: 2
Reputation: 1290
Try with the following:
set(link.get('href') for link in soup.findAll(name='link') if link.has_attr("href"))
Upvotes: 0
Reputation: 142
You can loop through your set and use regex to filter each element in the set. For the None, you can simply check if the value is None or not.
Upvotes: 0