Webscrape w/o beautiful soup

I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex. Thank you for your time, here is my code so far below.

import urllib
import re

def webscraping(website):
    fhand = urllib.urlopen(website).read()
    for line in fhand:
        line = fhand.strip()
        if line.startswith('h'):
            print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")

Upvotes: 0

Answers (2)

lawrence

Reputation: 1

never use regex to parse HTML, you can use Beautiful Soup here is an example

import urllib
from BeautifulSoup import *

todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)

while len(todo) > 0 :
   print "====== Todo list count is ",len(todo)
   url = todo.pop()

   if ( not url.startswith('http') ) : 
       print "Skipping", url
       continue

   if ( url.find('facebook') > 0 ) :
       continue

   if ( url in visited ) :
       print "Visited", url
       continue

   print "===== Retrieving ", url

   html = urllib.urlopen(url).read()
   soup = BeautifulSoup(html)
   visited.append(url)

   # Retrieve all of the anchor tags
   tags = soup('a')
   for tag in tags:
       newurl = tag.get('href', None)
       if ( newurl != None ) : 
           todo.append(newurl)

Upvotes: 0

Pythonista

Reputation: 11645

Going to go ahead and say this:

and return a list of the ones that match, preferably using regex.

No. You ~~don't~~ absolutely shouldn't use regex to parse HTML. That's why we have HTML parsers exactly for that job.

Use BeautifulSoup, it has everything built-in and it's relatively easy to do something like this: (Not tested)

def webscraping(website):

   fhand = urllib.urlopen(website).read()
   soup = BeautifulSoup(fhand, "html.parser")
   soup.find_all(text=lambda x: x.startswith('h'))

Upvotes: 1

Webscrape w/o beautiful soup

Answers (2)

Related Questions