Reputation: 969
I write a small script to read all hrefs from web page with python.
But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648"
for example.
code:
import urllib
import re
urls = ["http://something.com"]
regex='href=\"(.+?)\"'
pattern = re.compile(regex)
htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs
Can anybody help me? Thanks.
Upvotes: 1
Views: 110
Reputation: 27513
use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps
import requests
from bs4 import BeautifulSoup
url = 'whatever url you want to parse'
result = requests.get(url)
soup = BeautifulSoup(result.content,'html.parser')
for a in soup.find_all('a',href=True):
print "Found the URL:", a['href']
Upvotes: 1