Reputation: 2716
I have a complex HTML document that has nested <div>
tags, such as the following:
<html>
<body>
<div id="one">
<p>1. Get this div!</p>
</div>
<div id="two">
<div>
<div id="three">
<p>2. Get this div!</p>
</div>
</div>
<div id="four">
<p>3. Get this div!</p>
</div>
</div>
</body>
</html>
And I am attempting to use the following code:
soup = BeautifulSoup(html, 'html.parser')
div_list = soup.find_all('div')
However, the code above only gets the topmost level divs, which means it will return only divs with ids "one" and "two". However, I would like to use BeautifulSoup to return a list of divs with ids "one", "three", and "four". How can I accomplish this?
Upvotes: 2
Views: 1105
Reputation: 71451
The simplest way is to create a list with the desired ids, and then use re.compile
:
from bs4 import BeautifulSoup as soup
import re
ids = ['one', 'three', 'four']
results = soup(content, 'html.parser').find_all('div', {'id':re.compile('|'.join(ids))})
for i in results:
print(i)
print('-'*20)
Output:
<div id="one">
<p>1. Get this div!</p>
</div>
--------------------
<div id="three">
<p>2. Get this div!</p>
</div>
--------------------
<div id="four">
<p>3. Get this div!</p>
</div>
--------------------
However, without using a list for searching, recursion can be used:
def get_ids(_d):
if not any(getattr(i, '__dict__', {}).get('name') == 'div' for i in _d.__dict__['contents']):
return _d
_r = [get_ids(i) for i in _d.__dict__['contents'] if getattr(i, '__dict__', {}).get('name') == 'div']
return None if not _r else _r[0]
final_results = []
for i in [get_ids(i) for i in soup(content, 'html.parser').find_all('div')]:
if i not in s:
s.append(i)
print(final_results)
Output:
[<div id="one"><p>1. Get this div!</p></div>, <div id="three"><p>2. Get this div!</p></div>, <div id="four"><p>3. Get this div!</p></div>]
Upvotes: 1
Reputation: 57033
You can directly check whether any of the found divisions has more divisions inside:
[d for d in soup.findAll('div') if not d.find('div')]
#[<div id="one"><p>1. Get this div!</p></div>,
# <div id="three"><p>2. Get this div!</p></div>,
# <div id="four"><p>3. Get this div!</p></div>]
Upvotes: 1