ignassz
ignassz

Reputation: 15

BeautifulSoup and Regular Expressions - extracting text from tags

I'm writing a small text scraping script with Python. It's my first bigger project so I have some problems. I'm using urllib2 and BeautifulSoup. I want to scrape song names from one playlist. I can get one song name or all song names + other strings that I don't need. I can't manage to get only all song names. My code that gets all song names + other strings that I don't need:

import urllib2
from bs4 import BeautifulSoup
import re

response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr')[0]:
    for td in soup.findAll('a'):
        print td.contents[0]

And code which gives me one song:

print soup.findAll('tr')[1].findAll('a')[0].contents[0]

It's actually not a loop so I can't get no more than one, but if I try to make it loop, I got like 10 same song names. That code:

for tr in soup.findAll('tr')[1]:
    for td in soup.findAll('td')[0]:
        print td.contents[0]

I'm stuck for a day now and I can't get it working. I don't understand how does these things work.

Upvotes: 0

Views: 943

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1125058

You should be a little more specific in your search, then just loop over the table rows; grab the specific table by css class, loop over the tr elements except the first one using slicing, grab all text from the first td:

table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
    print ''.join(row.find('td').stripped_strings)

Alternatively to slicing off the first row, you can skip the thead by testing for that:

for row in table.find_all('tr'):
    if row.parent.name == 'thead':
        continue
    print ''.join(row.find('td').stripped_strings)

It would have been better all around if the page had used a proper <tbody> tag instead. :-)

Upvotes: 1

jkozera
jkozera

Reputation: 3066

for tr in soup.findAll('tr'):  # 1
    if not tr.find('td'): continue  # 2
    for td in tr.find('td').findAll('a'):  # 3
        print td.contents[0]
  1. You want to iterate over all tr's, hence findAll('tr') instead of findAll('tr') [0].
  2. Some rows don't contain td, so we need to skip them to avoid AttributeError (try removing this line)
  3. As in 1, you want all a's in first td, but also "for td in tr.find", not "for td in soup.find", because you want to look in tr's not in the whole document (soup).

Upvotes: 1

Related Questions