Reputation: 15
I'm writing a small text scraping script with Python. It's my first bigger project so I have some problems. I'm using urllib2 and BeautifulSoup. I want to scrape song names from one playlist. I can get one song name or all song names + other strings that I don't need. I can't manage to get only all song names. My code that gets all song names + other strings that I don't need:
import urllib2
from bs4 import BeautifulSoup
import re
response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)
for tr in soup.findAll('tr')[0]:
for td in soup.findAll('a'):
print td.contents[0]
And code which gives me one song:
print soup.findAll('tr')[1].findAll('a')[0].contents[0]
It's actually not a loop so I can't get no more than one, but if I try to make it loop, I got like 10 same song names. That code:
for tr in soup.findAll('tr')[1]:
for td in soup.findAll('td')[0]:
print td.contents[0]
I'm stuck for a day now and I can't get it working. I don't understand how does these things work.
Upvotes: 0
Views: 943
Reputation: 1125058
You should be a little more specific in your search, then just loop over the table rows; grab the specific table by css class, loop over the tr
elements except the first one using slicing, grab all text from the first td
:
table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
print ''.join(row.find('td').stripped_strings)
Alternatively to slicing off the first row, you can skip the thead
by testing for that:
for row in table.find_all('tr'):
if row.parent.name == 'thead':
continue
print ''.join(row.find('td').stripped_strings)
It would have been better all around if the page had used a proper <tbody>
tag instead. :-)
Upvotes: 1
Reputation: 3066
for tr in soup.findAll('tr'): # 1
if not tr.find('td'): continue # 2
for td in tr.find('td').findAll('a'): # 3
print td.contents[0]
findAll('tr')
instead of findAll('tr')
[0]
.for td in tr.find
", not "for td in soup.find
", because you want to look in tr
's not in the whole document (soup
).Upvotes: 1