Jonathan Holloway
Jonathan Holloway

Reputation: 434

How to scrape using Python a link from a html class

I am attempting to grab the link from the website. Its the sound of the word. The website is http://dictionary.reference.com/browse/would?s=t

so I am using the following code to get the link but it is coming up up blank. This is weird because I can use a similar set up and pull data from a stock. The idea is to build a program that gives the sound of the word then I will ask for the spelling. This is for my kids pretty much. I needed to go through a list of words to get the links in a dictionary but having trouble getting the link to print out. I'm using urllib and re code below.

import urllib
import re
words = [ "would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)
    htmltext = htmlfile.read()
    regex = '<a class="speaker" href =>(.+?)</a>' #puts tag together
    pattern = re.compile(regex)
    link = re.findall(pattern, htmltext)
    print "the link for the word", word, link #should print link

This is the expected output for the word would http://static.sfdict.com/staticrep/dictaudio/W02/W0245800.mp3

Upvotes: 1

Views: 637

Answers (1)

alecxe
alecxe

Reputation: 474021

You should fix your regular expression to grab everything inside the href attribute value:

<a class="speaker" href="(.*?)"

Note that you should really consider switching from regex to HTML parsers, like BeautifulSoup.

Here is how you can apply BeautifulSoup in this case:

import urllib

from bs4 import BeautifulSoup

words = ["would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)

    soup = BeautifulSoup(htmlfile, "html.parser")
    links = [link["href"] for link in soup.select("a.speaker")]

    print(word, links)

Upvotes: 2

Related Questions