Rajeev
Rajeev

Reputation: 46979

Get contents by class names using Beautiful Soup

Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent? Is it:

soup.class['feeditemcontent cxfeeditemcontent']

or:

soup.find_all('class')

This is the HTML source:

<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
         <div class="feeditembody">
         <span>The actual data is some where here</span>
         </div>
     </div>
 </div> 

and this is the Python code:

 from BeautifulSoup import BeautifulSoup
 html_doc = open('home.jsp.html', 'r')

 soup = BeautifulSoup(html_doc)
 class="feeditemcontent cxfeeditemcontent"

Upvotes: 17

Views: 64045

Answers (6)

Aziz Alto
Aziz Alto

Reputation: 20401

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

So, If I want to get all div tags of class header <div class="header"> from stackoverflow.com, an example with BeautifulSoup would be something like:

from bs4 import BeautifulSoup as bs
import requests 

url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)

tags = soup.findAll("div", class_="header")

It is already in bs4 documentation.

Upvotes: 11

user1438327
user1438327

Reputation:

from BeautifulSoup import BeautifulSoup 
f = open('a.htm')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'id':'abc def'})
print list

Upvotes: 6

Leonard Richardson
Leonard Richardson

Reputation: 4164

Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:

from bs4 import BeautifulSoup                                                   

def match_class(target):                                                        
    def do_match(tag):                                                          
        classes = tag.get('class', [])                                          
        return all(c in classes for c in target)                                
    return do_match                                                             

soup = BeautifulSoup(html)                                                      
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))

Upvotes: 25

jadkik94
jadkik94

Reputation: 7078

Try this, maybe it's too much for this simple thing but it works:

def match_class(target):
    target = target.split()
    def do_match(tag):
        try:
            classes = dict(tag.attrs)["class"]
        except KeyError:
            classes = ""
        classes = classes.split()
        return all(c in classes for c in target)
    return do_match

html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
    print m
    print "-"*10

matches = soup.findAll(match_class("feeditembody"))
for m in matches:
    print m
    print "-"*10

Upvotes: 12

UltraInstinct
UltraInstinct

Reputation: 44464

Check this bug report: https://bugs.launchpad.net/beautifulsoup/+bug/410304

As you can see, Beautiful soup can not really understand class="a b" as two classes a and b.

However, as it appears in the first comment there, a simple regexp should suffice. In your case:

soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
    print "result: ",x

Note: That has been fixed in the recent beta. I haven't gone through the docs of the recent versions, may be you could do that. Or if you want to get it working using the older version, you could use the above.

Upvotes: 0

Jordan Dimov
Jordan Dimov

Reputation: 1318

soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})

Upvotes: 3

Related Questions