Reputation: 19
I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex. Thank you for your time, here is my code so far below.
import urllib
import re
def webscraping(website):
fhand = urllib.urlopen(website).read()
for line in fhand:
line = fhand.strip()
if line.startswith('h'):
print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")
Upvotes: 0
Views: 71
Reputation: 1
never use regex to parse HTML, you can use Beautiful Soup here is an example
import urllib
from BeautifulSoup import *
todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)
while len(todo) > 0 :
print "====== Todo list count is ",len(todo)
url = todo.pop()
if ( not url.startswith('http') ) :
print "Skipping", url
continue
if ( url.find('facebook') > 0 ) :
continue
if ( url in visited ) :
print "Visited", url
continue
print "===== Retrieving ", url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visited.append(url)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
newurl = tag.get('href', None)
if ( newurl != None ) :
todo.append(newurl)
Upvotes: 0
Reputation: 11645
Going to go ahead and say this:
and return a list of the ones that match, preferably using regex.
No. You don't absolutely shouldn't use regex to parse HTML. That's why we have HTML parsers exactly for that job.
Use BeautifulSoup
, it has everything built-in and it's relatively easy to do something like this: (Not tested)
def webscraping(website):
fhand = urllib.urlopen(website).read()
soup = BeautifulSoup(fhand, "html.parser")
soup.find_all(text=lambda x: x.startswith('h'))
Upvotes: 1