Reputation: 43
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
lis = []
webpage = urllib.urlopen('http://facts.randomhistory.com/interesting-facts-about- cats.html')
soup = BeautifulSoup(webpage)
for ul in soup:
for li in soup.findAll('li'):
lis.append(li)
for li in lis:
print li.text.encode("utf-8")
I'm just trying to get the cat facts from between the opening and closing "li" tags and output them in a way that doesn't look messed up. Currently, the output from this code repeats all of the facts 4 times or so and the word "can't" comes out as "can’t".
I'd appreciate any help.
Upvotes: 0
Views: 238
Reputation: 129079
Its Content-Type
says its encoding is ISO-8859-1
, but it is lying. Tell Beautiful Soup to ignore its lies using from_encoding
. You can make Beautiful Soup do less work by giving it a SoupStrainer
for parse_only
that selects only things with the content-td
class. Finally, you can simplify your for
loops. All together:
import urllib2
import bs4
webpage = urllib2.urlopen('http://facts.randomhistory.com/interesting-facts-about-cats.html')
soup = bs4.BeautifulSoup(webpage, from_encoding='UTF-8',
parse_only=bs4.SoupStrainer(attrs='content-td'))
for li in soup('li'):
print li.text.encode('utf-8')
You can further improve the output by replacing consecutive whitespace with a single space and removing the superscripts.
Upvotes: 1
Reputation: 64328
You don't need the outer loop (for ul in soup
). It will output once if you remove it.
soup = BeautifulSoup(webpage)
for li in soup.findAll('li'):
lis.append(li)
for li in lis:
print li.text.encode("utf-8")
Upvotes: 1