Snaxib
Snaxib

Reputation: 1690

How to find tags with only certain attributes - BeautifulSoup

How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?

For example, I want to find all <td valign="top"> tags.

The following code: raw_card_data = soup.fetch('td', {'valign':re.compile('top')})

gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top

I also tried: raw_card_data = soup.findAll(re.compile('<td valign="top">')) and this returns nothing (probably because of bad regex)

I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"

UPDATE FOr example, if an HTML document contained the following <td> tags:

<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />

I would want only the first <td> tag (<td width="580" valign="top">) to return

Upvotes: 133

Views: 225324

Answers (9)

If you want to print the name of all the tags in different lines that have specific attribute , for example printing all the tags with id attribute irrespective of the value :

from bs4 import BeautifulSoup ;
from bs4 import element ;
html = '!DOCTYPE html><html><head><title>Navigate Parse Tree</title></head>\
<body><h1>This is your Assignment</h1><a href = "https://www.google.com">This is a link that will take you to Google</a>\
<ul><li><p> This question is given to test your knowledge of <b>Web Scraping</b></p>\
<p>Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web.</p></li>\
<li id = "li2">This is an li tag given to you for scraping</li>\
<li>This li tag gives you the various ways to get data from a website\
<ol><li class = "list_or">Using API of the website</li><li>Scrape data using BeautifulSoup</li><li>Scrape data using Selenium</li>\
<li>Scrape data using Scrapy</li></ol></li>\
<li class = "list_or"><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">\
Clicking on this takes you to the documentation of BeautifulSoup</a>\
<a href="https://selenium-python.readthedocs.io/" id="anchor">Clicking on this takes you to the documentation of Selenium</a>\
</li></ul></body></html>'

data = BeautifulSoup(html, 'html.parser');
for i in data.descendants :
     if type(i) == element.Tag:
        if i.attrs != {} and 'id' in i.attrs:
           print(i.name)

Upvotes: 1

Michael
Michael

Reputation: 61

If you are looking to pull all tags where a particular attribute is present at all, you can use the same code as the accepted answer, but instead of specifying a value for the tag, just put True.

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})

This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.

Upvotes: 6

Shah Vipul
Shah Vipul

Reputation: 745

find using an attribute in any tag

<th class="team" data-sort="team">Team</th>    
soup.find_all(attrs={"class": "team"}) 

<th data-sort="team">Team</th>  
soup.find_all(attrs={"data-sort": "team"}) 
 

Upvotes: 6

Lo&#239;c G.
Lo&#239;c G.

Reputation: 3157

As explained on the BeautifulSoup documentation

You may use this :

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

EDIT :

To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :

from BeautifulSoup import BeautifulSoup

html = '<td valign="top">.....</td>\
        <td width="580" valign="top">.......</td>\
        <td>.....</td>'

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

for result in results :
    if len(result.attrs) == 1 :
        print result

That returns :

<td valign="top">.....</td>

Upvotes: 149

Amr
Amr

Reputation: 2235

if you want to only search with attribute name with any value

from bs4 import BeautifulSoup
import re

soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})

as per Steve Lorimer better to pass True instead of regex

results = soup.findAll("td", {"valign" : True})

Upvotes: 61

DoctorZebra
DoctorZebra

Reputation: 631

Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:

from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
    <td width="580" valign="top">.......</td>\
    <td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')

Upvotes: 4

Chris Redford
Chris Redford

Reputation: 17788

The easiest way to do this is with the new CSS style select method:

soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')

Upvotes: 21

Yogesh
Yogesh

Reputation: 895

You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:

td_tag_list = soup.findAll(
                lambda tag:tag.name == "td" and
                len(tag.attrs) == 1 and
                tag["valign"] == "top")

Upvotes: 68

juliomalegria
juliomalegria

Reputation: 24941

Just pass it as an argument of findAll:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]

Upvotes: 5

Related Questions