guidetuanhp
guidetuanhp

Reputation: 43

How to get href from HTML class?

I want to get href from a = soup.find_all('div', class_='email-messages').

[<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, '[email protected]')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>]

My code:

soup = BeautifulSoup(html_doc, 'lxml')
a = soup.find_all('div', class_='email-messages')
for link in a:
    print(link['href'])

I got error:

in __getitem__
    return self.attrs[key]
KeyError: 'href'

Upvotes: 1

Views: 192

Answers (2)

cards
cards

Reputation: 4965

For "single-purpose" scraping it is quite useful to make use of parser customization, SoupStrainer. It is faster (or it should be!) since it localize only the desired portion of the document to be scraped. Details here.

The SoupStrainer instance must always passed as key-value pair of a BeautifulSoup instance with key parse_only:

from bs4 import BeautifulSoup, SoupStrainer

html_doc = # see above

soup = BeautifulSoup(html_doc, 'lxml', parse_only=SoupStrainer('a', href=True))
for tag in soup:
    print(tag['href'])

Output

/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete

Remember

  1. the soup is "strained" and you will deal with a soupobject and not with a list. So the loop variable is a bs4.element.Tagobject!
  2. SoupStrainer has the same signature of the find_all method

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195408

You're trying to get "href" from the <div> tag. Try to find all <a> tags inside the <div>s:

from bs4 import BeautifulSoup

html_doc = """<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, '[email protected]')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>"""

soup = BeautifulSoup(html_doc, "html.parser")


divs = soup.find_all("div", class_="email-messages")
for div in divs:
    for link in div.find_all("a"):
        print(link["href"])

Prints:

/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete

Upvotes: 1

Related Questions