Removing '#' from the scraped links

Hi I am beginner with web scraping. I am trying to scrape all the links from a website and I am successful to some extent.

import requests
from bs4 import BeautifulSoup

url = 'https://www.marian.ac.in/'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

soup.title
soup.title.string

for link in soup.find_all('a',href=True):
    print(link['href'])

The issue I am facing is the output has '#'.How shall I remove this?

Can anyone help with this?

Upvotes: 0

Views: 248

Answers (2)

Roy
Roy

Reputation: 344

The # entries you are getting are actually from some href entries. Screenshot attached from the website. We can simply filter them out by adding an if condition inside for loop like this.

for link in soup.find_all('a', href=True):
    if not link['href'].strip() == "#":
        print(link['href'])

This will return few non url entries like "javascript:void(0);" or "semester-register-login" as well. If We don't want those entries as well we need to modify the condition.

enter image description here

Upvotes: 1

SIM
SIM

Reputation: 22440

Try the following to get the links that do not starts with #. You can choose either of the conditions to meet the requirement:

import requests
from bs4 import BeautifulSoup

url = 'https://www.marian.ac.in/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a',href=True):
    if link['href'].strip().startswith("#"):continue
    # if not link['href'].startswith("http"):continue
    print(link['href'])

Upvotes: 3

Related Questions