How to scrape using Python a link from a html class

Question

I am attempting to grab the link from the website. Its the sound of the word. The website is http://dictionary.reference.com/browse/would?s=t

so I am using the following code to get the link but it is coming up up blank. This is weird because I can use a similar set up and pull data from a stock. The idea is to build a program that gives the sound of the word then I will ask for the spelling. This is for my kids pretty much. I needed to go through a list of words to get the links in a dictionary but having trouble getting the link to print out. I'm using urllib and re code below.

import urllib
import re
words = [ "would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)
    htmltext = htmlfile.read()
    regex = '(.+?)' #puts tag together
    pattern = re.compile(regex)
    link = re.findall(pattern, htmltext)
    print "the link for the word", word, link #should print link

This is the expected output for the word would http://static.sfdict.com/staticrep/dictaudio/W02/W0245800.mp3

alecxe · Accepted Answer

You should fix your regular expression to grab everything inside the href attribute value:



Note that you should really consider switching from regex to HTML parsers, like BeautifulSoup.

Here is how you can apply BeautifulSoup in this case:

import urllib

from bs4 import BeautifulSoup

words = ["would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)

    soup = BeautifulSoup(htmlfile, "html.parser")
    links = [link["href"] for link in soup.select("a.speaker")]

    print(word, links)

How to scrape using Python a link from a html class

Answers (1)

Related Questions