Reputation: 630
Given some html code, how can I remove all the tags, keep the text and img and a tags? For example, I have
<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>
I want to keep
Hello to <a href ="xx"></a> <img rscr="xx"></img>
Is there something implemented in BeautifulSoup or Python?
Thanks
Upvotes: 5
Views: 3417
Reputation: 241078
You could select all of the descendant nodes by accessing the .descendants
property.
From there, you could iterate over all of the descendants and filter them based on the name
property. If the node doesn't have a name
property, then it is likely a text node, which you want to keep. If the name
property is a
or img
, then you keep it as well.
# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []
for node in container.descendants:
if not node.name or node.name == 'a' or node.name == 'img':
keep.append(node)
Here is an alternative where all the filtered elements are used to create the list directly:
# This should be the wrapper that you are targeting
container = soup.find('div')
keep = [node for node in container.descendants
if not node.name or node.name == 'a' or node.name == 'img']
Also, if you don't want strings that are empty to be returned, you can trim the whitespace and check for that as well:
keep = [node for node in container.descendants
if (not node.name and len(node.strip())) or
(node.name == 'a' or node.name == 'img')]
Based on the HTML that you provided, the following would be returned:
> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]
Upvotes: 3
Reputation: 12168
import bs4
html = '''<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
soup.div.text, soup.div.find_next('a'), soup.div.find_next('img')
out:
('Hello all ', <a href="xx"></a>, <img rscr="xx"/>)
When the next element is tag's descendent, there is a shortcut:
soup.div.text, soup.div.a, soup.div.img
out:
('Hello all ', <a href="xx"></a>, <img rscr="xx"/>)
find_next
to get next element in the DOMUpvotes: 0