Reputation: 592
I have written code to extract the url and title of a book using BeautifulSoup
from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between >
and </a>
tags.
How can I extract the name of the book?
I have tried the findnext
method recommended in another question, but I get an AttributeError
on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)
Upvotes: 3
Views: 203
Reputation: 10090
You should use the text
attribute of the element. The following works for me:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
I get the following output for the element in question
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930
Upvotes: 3
Reputation: 1936
According to the BeautifulSoup documentation the .string
property should accomplish what you are trying to do, by editing your original listing this way:
# ...
try:
print li.a['href'], li.a['title']
print "\n"
print li.a.string
except KeyError:
pass
# ...
You probably want to surround it with something like
if li.a['class'] == "extiw":
print li.a.string
since, in your example, only the anchors of class extiw
contain a book title.
Thanks @wilbur for pointing out the optimal solution.
Upvotes: 3
Reputation: 9038
To get just the text that is not inside any tags use the get_text()
method. It is in the documentation here.
I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li
tag since there doesn't seem to be any other text.
Try replacing this:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
with this:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass
Upvotes: 1
Reputation: 620
I did not see how you can extract the text within the tag. I would do something like this:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title
Upvotes: 1