Reputation: 1865
I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.
Here is my string:
string = <p>...</p>
I want to get <p>
in a variable called opening_tag
and </p>
in a variable called closing_tag
. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?
Upvotes: 0
Views: 3908
Reputation: 1
Using BeautifulSoup:
from bs4 import BeautifulSoup, Tag
def get_tags(bs4_element: Tag):
try:
opening_tag, closing_tag = str(bs4_element).split(
''.join(str(child) for child in bs4_element.children)
)
return opening_tag, closing_tag
except ValueError:
print('Cannot parse children correctly')
return None
The function can be used for example in:
soup = BeautifulSoup(text)
for element in soup.find_all():
print(get_tags(element))
Old answer:
One simple approach that will only work for childless elements:
opening_tag, closing_tag = str(element).split(element.text)
Upvotes: 0
Reputation: 83
As far as I know there is no built in method in BeautifulSoup
API that returns the opening tag as it is, but we can create a little function for that.
from bs4 import BeautifulSoup
from bs4.element import Tag
# here's your function
def get_opening_tag(element: Tag) -> str:
"""returns the opening tag of the given element"""
raw_attrs = {k: v if not isinstance(v, list) else ' '.join(v) for k, v in element.attrs.items()}
attrs = ' '.join((f"{k}=\"{v}\"" for k, v in raw_attrs.items()))
return f"<{element.name} {attrs}>"
def test():
markup = """
<html>
<body>
<div id="root" class="class--name">
...
</div>
</body>
</html>
"""
# if you're interested in the div tag
element = BeautifulSoup(markup, 'lxml').select_one("#root")
print(get_opening_tag(element))
if __name__ == '__main__':
test()
Upvotes: 1
Reputation: 76
There is a way to do this with BeautifulSoup and a simple reg-ex:
Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
For the contents between the opening (<p>
) and closing (</p>
) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).
Then, soupParagraph will just have the opening and closing tags.
Convert soupParagraph to HTML text-format and store that in a string variable
To get the opening tag, use a regular expression to remove the closing tag from the string variable.
In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.
A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.
Can I have attributes on closing tags?
HTML Comments inside Opening Tag of the Element
This code gets the opening tag from a <body...>
... </body>
section. The code has been tested.
# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
# .append moves the HTML element from body to bodyInnerHtml
bodyInnerHtml.append(bodyContentsList[0])
# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
print("")
print("ERROR. The expected HTML </body> tag was not found.")
Upvotes: 1
Reputation: 473863
There is no direct way to get opening and closing parts of the tag in BeautifulSoup
, but, at least, you can get the name of it:
>>> from bs4 import BeautifulSoup
>>>
>>> html_content = """
... <body>
... <p>test</p>
... </body>
... """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p
With html.parser
though you can listen to "start" and "end" tag "events".
Upvotes: 2