Using Beautiful Soup, grabbing stuff between
and

Question

Here is the code I have so far:

import urllib
from bs4 import BeautifulSoup

lis = []
webpage = urllib.urlopen('http://facts.randomhistory.com/interesting-facts-about-     cats.html')
soup = BeautifulSoup(webpage)
for ul in soup:
    for li in soup.findAll('li'):
        lis.append(li)
    for li in lis:
        print li.text.encode("utf-8")

I'm just trying to get the cat facts from between the opening and closing "li" tags and output them in a way that doesn't look messed up. Currently, the output from this code repeats all of the facts 4 times or so and the word "can't" comes out as "canâ€™t".

I'd appreciate any help.

icktoofay · Accepted Answer

Its Content-Type says its encoding is ISO-8859-1, but it is lying. Tell Beautiful Soup to ignore its lies using from_encoding. You can make Beautiful Soup do less work by giving it a SoupStrainer for parse_only that selects only things with the content-td class. Finally, you can simplify your for loops. All together:

import urllib2
import bs4

webpage = urllib2.urlopen('http://facts.randomhistory.com/interesting-facts-about-cats.html')
soup = bs4.BeautifulSoup(webpage, from_encoding='UTF-8',
                         parse_only=bs4.SoupStrainer(attrs='content-td'))
for li in soup('li'):
    print li.text.encode('utf-8')

You can further improve the output by replacing consecutive whitespace with a single space and removing the superscripts.

Using Beautiful Soup, grabbing stuff between <li> and </li>

Answers (2)

Related Questions

Using Beautiful Soup, grabbing stuff between &lt;li&gt; and &lt;/li&gt;

Answers (2)

Related Questions

Using Beautiful Soup, grabbing stuff between <li> and </li>