Reputation: 35
I'm currently trying to write a script that takes a url to lyricswikia and pulls the lyrics off of the site. I think I've figured out how to isolate the relevant div tag, but for some reason python outputs it with a "b'" in front of the div tag, and I don't know how to extract the lyrics from between the scripts within the div tag. My code is as follows:
from bs4 import BeautifulSoup
import requests
#gets webpage
r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')
string = r.content
soup = BeautifulSoup(string[3:])
results = soup.find('div', {'class': 'lyricbox'}).encode('utf-8')
print(results)
EDIT: My end goal is still to print the lyrics and only the lyrics on the webpage, as a string, so I need to convert the byte literal object into a string and somehow remove the comment at the end. I tried removing the .encode('utf-8') from Vincent's suggested code below, and it works but it spits out an error at the start of the comments at the end.
Upvotes: 0
Views: 895
Reputation: 1710
If you only need the lyric texts, I would suggest using pyquery instead of BeautifulSoup because I find the former simpler to use in many cases. (There are cases where BS excels, but this isn't necessarily one of them.)
import requests
from pyquery import PyQuery as pq
r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')
# You could also use r.content but it does not affect the result
doc = pq(r.text)
# Remove the script element; the HTML comment is ignored using .text()
print(doc('div.lyricbox').remove('script').text())
Update: I just noticed this was tagged Python3, and I don't have a box with it for testing at this time but I would assume it should work as is (I changed print() on the last line).
Upvotes: 0
Reputation: 2104
The b, following the https://docs.python.org/2/reference/lexical_analysis.html#string-literals
A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.
Or for python 3 https://docs.python.org/3.3/reference/lexical_analysis.html#string-literals:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
Using either Python2 or Python3, this prints the whole lyrik out.
from __future__ import print_function
from bs4 import BeautifulSoup
import requests
#gets webpage
r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')
soup = BeautifulSoup(r.text)
for child in soup.select('div.lyricbox')[0].children:
if child.name == None:
print(child.encode('utf-8'))
Note: There are still some comments on the end.
Upvotes: 1