Reputation: 71
def parsehttp(url):
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
for link in soup.find_all('a'):
href = link.attrs.get("href")
print(href)
I would like to be able to extract all outgoing links from a website, however, the code that I have right now is returning both relative links and outgoing links and I only want the outgoing links. The difference is outgoing links has the https portion in them while relative ones do not. I also want to obtain the 'title' portion that comes with each link as well.
Upvotes: 2
Views: 372
Reputation: 499
for link in soup.find_all('a'):
href = link.attrs.get("href", "")
if not href.startwith("https://"):
continue
print(href)
Upvotes: 0
Reputation: 39840
You can use a regular expression:
for link in soup.findAll('a', attrs={'href': re.compile("^(http|https)://")}):
href = link.attrs.get("href")
if href is not None:
print(href)
Upvotes: 2
Reputation: 71
you can check if the first 5 characters of href are https to identify this:
if href[0:5] == "https":
#outgoing link
else:
#incoming link
Upvotes: 0