user2256199
user2256199

Reputation: 43

Using Beautiful Soup, grabbing stuff between <li> and </li>

Here is the code I have so far:

import urllib
from bs4 import BeautifulSoup

lis = []
webpage = urllib.urlopen('http://facts.randomhistory.com/interesting-facts-about-     cats.html')
soup = BeautifulSoup(webpage)
for ul in soup:
    for li in soup.findAll('li'):
        lis.append(li)
    for li in lis:
        print li.text.encode("utf-8")

I'm just trying to get the cat facts from between the opening and closing "li" tags and output them in a way that doesn't look messed up. Currently, the output from this code repeats all of the facts 4 times or so and the word "can't" comes out as "can’t".

I'd appreciate any help.

Upvotes: 0

Views: 238

Answers (2)

icktoofay
icktoofay

Reputation: 129079

Its Content-Type says its encoding is ISO-8859-1, but it is lying. Tell Beautiful Soup to ignore its lies using from_encoding. You can make Beautiful Soup do less work by giving it a SoupStrainer for parse_only that selects only things with the content-td class. Finally, you can simplify your for loops. All together:

import urllib2
import bs4

webpage = urllib2.urlopen('http://facts.randomhistory.com/interesting-facts-about-cats.html')
soup = bs4.BeautifulSoup(webpage, from_encoding='UTF-8',
                         parse_only=bs4.SoupStrainer(attrs='content-td'))
for li in soup('li'):
    print li.text.encode('utf-8')

You can further improve the output by replacing consecutive whitespace with a single space and removing the superscripts.

Upvotes: 1

shx2
shx2

Reputation: 64328

You don't need the outer loop (for ul in soup). It will output once if you remove it.

soup = BeautifulSoup(webpage)
for li in soup.findAll('li'):
    lis.append(li)
for li in lis:
    print li.text.encode("utf-8")

Upvotes: 1

Related Questions