thenorm
thenorm

Reputation: 35

BeautifulSoup Python adding extra characters

I'm currently trying to write a script that takes a url to lyricswikia and pulls the lyrics off of the site. I think I've figured out how to isolate the relevant div tag, but for some reason python outputs it with a "b'" in front of the div tag, and I don't know how to extract the lyrics from between the scripts within the div tag. My code is as follows:

from bs4 import BeautifulSoup
import requests

#gets webpage
r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')
string = r.content
soup = BeautifulSoup(string[3:])

results = soup.find('div', {'class': 'lyricbox'}).encode('utf-8')
print(results)

EDIT: My end goal is still to print the lyrics and only the lyrics on the webpage, as a string, so I need to convert the byte literal object into a string and somehow remove the comment at the end. I tried removing the .encode('utf-8') from Vincent's suggested code below, and it works but it spits out an error at the start of the comments at the end.

Upvotes: 0

Views: 895

Answers (2)

Jarno Lamberg
Jarno Lamberg

Reputation: 1710

If you only need the lyric texts, I would suggest using pyquery instead of BeautifulSoup because I find the former simpler to use in many cases. (There are cases where BS excels, but this isn't necessarily one of them.)

import requests
from pyquery import PyQuery as pq

r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')

# You could also use r.content but it does not affect the result
doc = pq(r.text)

# Remove the script element; the HTML comment is ignored using .text()
print(doc('div.lyricbox').remove('script').text())

Update: I just noticed this was tagged Python3, and I don't have a box with it for testing at this time but I would assume it should work as is (I changed print() on the last line).

Upvotes: 0

Vincent Beltman
Vincent Beltman

Reputation: 2104

The b, following the https://docs.python.org/2/reference/lexical_analysis.html#string-literals

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

Or for python 3 https://docs.python.org/3.3/reference/lexical_analysis.html#string-literals:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Using either Python2 or Python3, this prints the whole lyrik out.

from __future__ import print_function
from bs4 import BeautifulSoup
import requests

#gets webpage
r = requests.get('http://lyrics.wikia.com/2Pac:Dear_Mama')
soup = BeautifulSoup(r.text)

for child in soup.select('div.lyricbox')[0].children:
    if child.name == None:
        print(child.encode('utf-8'))

Note: There are still some comments on the end.

Upvotes: 1

Related Questions