Mayhem
Mayhem

Reputation: 19

Webscrape w/o beautiful soup

I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex. Thank you for your time, here is my code so far below.

import urllib
import re

def webscraping(website):
    fhand = urllib.urlopen(website).read()
    for line in fhand:
        line = fhand.strip()
        if line.startswith('h'):
            print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")

Upvotes: 0

Views: 71

Answers (2)

lawrence
lawrence

Reputation: 1

never use regex to parse HTML, you can use Beautiful Soup here is an example

import urllib
from BeautifulSoup import *

todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)

while len(todo) > 0 :
   print "====== Todo list count is ",len(todo)
   url = todo.pop()

   if ( not url.startswith('http') ) : 
       print "Skipping", url
       continue

   if ( url.find('facebook') > 0 ) :
       continue

   if ( url in visited ) :
       print "Visited", url
       continue

   print "===== Retrieving ", url

   html = urllib.urlopen(url).read()
   soup = BeautifulSoup(html)
   visited.append(url)

   # Retrieve all of the anchor tags
   tags = soup('a')
   for tag in tags:
       newurl = tag.get('href', None)
       if ( newurl != None ) : 
           todo.append(newurl)

Upvotes: 0

Pythonista
Pythonista

Reputation: 11645

Going to go ahead and say this:

and return a list of the ones that match, preferably using regex. 

No. You don't absolutely shouldn't use regex to parse HTML. That's why we have HTML parsers exactly for that job.

Use BeautifulSoup, it has everything built-in and it's relatively easy to do something like this: (Not tested)

def webscraping(website):

   fhand = urllib.urlopen(website).read()
   soup = BeautifulSoup(fhand, "html.parser")
   soup.find_all(text=lambda x: x.startswith('h'))

Upvotes: 1

Related Questions