Reputation: 46979
Using Beautiful Soup module, how can I get data of a div
tag whose class name is feeditemcontent cxfeeditemcontent
? Is it:
soup.class['feeditemcontent cxfeeditemcontent']
or:
soup.find_all('class')
This is the HTML source:
<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>
and this is the Python code:
from BeautifulSoup import BeautifulSoup
html_doc = open('home.jsp.html', 'r')
soup = BeautifulSoup(html_doc)
class="feeditemcontent cxfeeditemcontent"
Upvotes: 17
Views: 64045
Reputation: 20401
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
So, If I want to get all div tags of class header <div class="header">
from stackoverflow.com, an example with BeautifulSoup would be something like:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
It is already in bs4 documentation.
Upvotes: 11
Reputation:
from BeautifulSoup import BeautifulSoup
f = open('a.htm')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'id':'abc def'})
print list
Upvotes: 6
Reputation: 4164
Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
Upvotes: 25
Reputation: 7078
Try this, maybe it's too much for this simple thing but it works:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
Upvotes: 12
Reputation: 44464
Check this bug report: https://bugs.launchpad.net/beautifulsoup/+bug/410304
As you can see, Beautiful soup can not really understand class="a b"
as two classes a
and b
.
However, as it appears in the first comment there, a simple regexp should suffice. In your case:
soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
print "result: ",x
Note: That has been fixed in the recent beta. I haven't gone through the docs of the recent versions, may be you could do that. Or if you want to get it working using the older version, you could use the above.
Upvotes: 0
Reputation: 1318
soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})
Upvotes: 3