Sam
Sam

Reputation: 592

Extract content of <a> tag

I have written code to extract the url and title of a book using BeautifulSoup from a page.

But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.

How can I extract the name of the book?

I have tried the findnext method recommended in another question, but I get an AttributeError on that.

HTML:

    <li>
        <a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
        <a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
        (English)
    </li>

Code below:

def make_soup(BASE_URL):
    r = requests.get(BASE_URL, verify = False)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_text_urls(html):
    soup = make_soup(BASE_URL)

    for li in soup.findAll('li'):
        try:
            try:
                print li.a['href'], li.a['title']
                print "\n"
            except KeyError:
                pass
        except TypeError:
            pass

extract_text_urls(filename)

Upvotes: 3

Views: 203

Answers (4)

wpercy
wpercy

Reputation: 10090

You should use the text attribute of the element. The following works for me:

def make_soup(BASE_URL):
    r = requests.get(BASE_URL)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_text_urls(html):
    soup = make_soup(BASE_URL)

    for li in soup.findAll('li'):
        try:
            try:
                print li.a['href'], li.a.text
                print "\n"
            except KeyError:
                pass
        except TypeError:
            pass

extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')

I get the following output for the element in question

//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930

Upvotes: 3

Tobia Tesan
Tobia Tesan

Reputation: 1936

According to the BeautifulSoup documentation the .string property should accomplish what you are trying to do, by editing your original listing this way:

    # ... 
        try:
            print li.a['href'], li.a['title']
            print "\n"
            print li.a.string
        except KeyError:
            pass
    # ... 

You probably want to surround it with something like

if li.a['class'] == "extiw":
    print li.a.string

since, in your example, only the anchors of class extiw contain a book title.

Thanks @wilbur for pointing out the optimal solution.

Upvotes: 3

dstudeba
dstudeba

Reputation: 9038

To get just the text that is not inside any tags use the get_text() method. It is in the documentation here.

I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li tag since there doesn't seem to be any other text.

Try replacing this:

    for li in soup.findAll('li'):
    try:
        try:
            print li.a['href'], li.a['title']
            print "\n"
        except KeyError:
            pass
    except TypeError:
        pass

with this:

    for li in soup.findAll('li'):
    try:
        print(li.get_text())
        print("\n")
    except TypeError:
        pass

Upvotes: 1

Zhiya
Zhiya

Reputation: 620

I did not see how you can extract the text within the tag. I would do something like this:

from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))

for li in soup.findall('li'):
    a = li.find('a')
    book_title = a.contents[0]
    print book_title

Upvotes: 1

Related Questions