Reputation: 1697
I am trying to get a list of all html tags from beautiful soup.
I see find all but I have to know the name of the tag before I search.
If there is text like
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""
How would I get a list like
list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]
I know how to do this with regex, but am trying to learn BS4
Upvotes: 44
Views: 96820
Reputation: 2646
If you want to find some specific HTML tags then try this:
html = driver.page_source
# driver.page_source: "<div>something</div>\n<div>something else</div>\n<div class='magical'>hi there</div>\n<p>ok</p>\n"
soup = BeautifulSoup(html)
for tag in soup.find_all(['a','div']): # Mention HTML tag names here.
print(tag.text)
# Result:
# something
# something else
# hi there
Upvotes: 3
Reputation: 3091
I thought I'd share my solution to a very similar question for those that find themselves here, later.
I needed to find all tags quickly but only wanted unique values. I'll use the Python calendar
module to demonstrate.
We'll generate an html calendar then parse it, finding all and only those unique tags present.
The below structure is very similar to the above, using set comprehensions:
from bs4 import BeautifulSoup
import calendar
html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
# Result
# {'table', 'td', 'th', 'tr'}
Upvotes: 4
Reputation: 473763
You don't have to specify any arguments to find_all()
- in this case, BeautifulSoup
would find you every tag in the tree, recursively.
Sample:
from bs4 import BeautifulSoup
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")
print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']
print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']
Upvotes: 72
Reputation: 665
Here is an efficient function that I use to parse different HTML and text documents:
def parse_docs(path, format, tags):
"""
Parse the different files in path, having html or txt format, and extract the text content.
Returns a list of strings, where every string is a text document content.
:param path: str
:param format: str
:param tags: list
:return: list
"""
docs = []
if format == "html":
for document in tqdm(get_list_of_files(path)):
# print(document)
soup = BeautifulSoup(open(document, encoding='utf-8').read())
text = '\n'.join([''.join(s.findAll(text=True)) for s in
soup.findAll(tags)]) # parse all <p>, <div>, and <h> tags
docs.append(text)
else:
for document in tqdm(get_list_of_files(path)):
text = open(document, encoding='utf-8').read()
docs.append(text)
return docs
a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div'])
will return a list of text strings.
Upvotes: 0