Reputation: 71

How can I extract outgoing links from a website in python?

def parsehttp(url):
    r = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'lxml')


    for link in soup.find_all('a'):
        href = link.attrs.get("href")
        print(href)

I would like to be able to extract all outgoing links from a website, however, the code that I have right now is returning both relative links and outgoing links and I only want the outgoing links. The difference is outgoing links has the https portion in them while relative ones do not. I also want to obtain the 'title' portion that comes with each link as well.

Upvotes: 2

Answers (3)

Agnes Kis

Reputation: 499

for link in soup.find_all('a'):
    href = link.attrs.get("href", "")
    if not href.startwith("https://"):
        continue
    
    print(href)

Upvotes: 0

Giorgos Myrianthous

Reputation: 39840

You can use a regular expression:

for link in soup.findAll('a', attrs={'href': re.compile("^(http|https)://")}):
    href = link.attrs.get("href")
    if href is not None:
        print(href)

Upvotes: 2

Laiba Abid

Reputation: 71

you can check if the first 5 characters of href are https to identify this:

if href[0:5] == "https":
   #outgoing link
else:
   #incoming link

Upvotes: 0

How can I extract outgoing links from a website in python?

Answers (3)

Related Questions