BeautifulSoup and Regular Expressions - extracting text from tags

Question

I'm writing a small text scraping script with Python. It's my first bigger project so I have some problems. I'm using urllib2 and BeautifulSoup. I want to scrape song names from one playlist. I can get one song name or all song names + other strings that I don't need. I can't manage to get only all song names. My code that gets all song names + other strings that I don't need:

import urllib2
from bs4 import BeautifulSoup
import re

response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr')[0]:
    for td in soup.findAll('a'):
        print td.contents[0]

And code which gives me one song:

print soup.findAll('tr')[1].findAll('a')[0].contents[0]

It's actually not a loop so I can't get no more than one, but if I try to make it loop, I got like 10 same song names. That code:

for tr in soup.findAll('tr')[1]:
    for td in soup.findAll('td')[0]:
        print td.contents[0]

I'm stuck for a day now and I can't get it working. I don't understand how does these things work.

jkozera · Accepted Answer

for tr in soup.findAll('tr'):  # 1
    if not tr.find('td'): continue  # 2
    for td in tr.find('td').findAll('a'):  # 3
        print td.contents[0]

You want to iterate over all tr's, hence findAll('tr') instead of findAll('tr') [0].
Some rows don't contain td, so we need to skip them to avoid AttributeError (try removing this line)
As in 1, you want all a's in first td, but also "for td in tr.find", not "for td in soup.find", because you want to look in tr's not in the whole document (soup).

BeautifulSoup and Regular Expressions - extracting text from tags

Answers (2)

Related Questions