Reputation: 31
new to SO and having some difficulty with scraping a table from a website using beautifulsoup.
The source html of the table goes something like this (repeated ad-nauseum for every artist/song/album):
<td class="subject">
<p title="song">song</p>
<p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>
And I'm trying to create an output file with all of that information. The code I'm using is:
with open('output.txt', 'w', encoding='utf-8') as f:
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
f.write("Information: %s" % tds[3].text)
which gets me an output like so:
Information:
song
singer | album
How do I change this to have it all on one line, and to also separate it properly? Ideally my output should be like this:
Song Title: song
Artist: singer
Album Name: album
Upvotes: 2
Views: 67
Reputation: 71471
You can use regular expressions with BeautifulSoup
:
from bs4 import BeautifulSoup as soup
import re
s = """
<td class="subject">
<p title="song">song</p>
<p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>
"""
s = soup(s, 'lxml')
data = [list(filter(None, c))[0] for c in [re.findall('title="song">(.*?)</p>|album">(.*?)<span class="bar">|</span>(.*?)</p>', str(i)) for i in s.find_all('td', {'class':'subject'})][0]]
for i in zip(['Song', 'Artist', 'Album'], data):
print('{}: {}'.format(*i))
Output:
Song: song
Artist: artist
Album: album
Upvotes: 1
Reputation: 2558
I think you are just close, you just need to process the results of tds
. I would do the following:
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'lxml')
html = """<td class="subject">
<p title="song">song</p>
<p class="singer" title="artist | album">artist<span class="bar">|</span>album</p>
</td>"""
tds = b.find_all('td')
data = tds[0]
t = data.text.split('\n')
song = t[1]
artist_album = t[2].split('|')
artist = artist_album[0]
album = artist_album[1]
print("Song:", song)
print("Artist:", artist)
print("Album:", album)
This should give you:
Song: song
Artist: artist
Album: album
Upvotes: 2