Web scraping: read all href

Question

I write a small script to read all hrefs from web page with python. But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.

code:

import urllib
import re

urls = ["http://something.com"]

regex='href=\"(.+?)\"'
pattern = re.compile(regex)

htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs

Can anybody help me? Thanks.

Exprator · Accepted Answer

use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps

import requests
from bs4 import BeautifulSoup

url = 'whatever url you want to parse'

result = requests.get(url)

soup = BeautifulSoup(result.content,'html.parser')

for a in soup.find_all('a',href=True):
    print "Found the URL:", a['href']

Web scraping: read all href

Answers (1)

Related Questions