Reputation: 1865

How to get the opening and closing tag in beautiful soup from HTML string?

I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.

Here is my string:

string = <p>...</p>

I want to get <p> in a variable called opening_tag and </p> in a variable called closing_tag. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?

Upvotes: 0

Answers (4)

ravle

Reputation: 1

Using BeautifulSoup:

from bs4 import BeautifulSoup, Tag

def get_tags(bs4_element: Tag):
    try:
        opening_tag, closing_tag = str(bs4_element).split(
            ''.join(str(child) for child in bs4_element.children)
        )
        return opening_tag, closing_tag
    except ValueError:
        print('Cannot parse children correctly')
        return None

The function can be used for example in:

soup = BeautifulSoup(text)

for element in soup.find_all():
    print(get_tags(element))

Old answer:

One simple approach that will only work for childless elements:

opening_tag, closing_tag = str(element).split(element.text)

Upvotes: 0

Adnan MARSO

Reputation: 83

As far as I know there is no built in method in BeautifulSoup API that returns the opening tag as it is, but we can create a little function for that.

from bs4 import BeautifulSoup
from bs4.element import Tag


# here's your function
def get_opening_tag(element: Tag) -> str:
    """returns the opening tag of the given element"""
    raw_attrs = {k: v if not isinstance(v, list) else ' '.join(v) for k, v in element.attrs.items()}
    attrs = ' '.join((f"{k}=\"{v}\"" for k, v in raw_attrs.items()))
    return f"<{element.name} {attrs}>"


def test():

    markup = """
    <html>
        <body>
            <div id="root" class="class--name">
                ...
            </div>
        </body>
    </html>
    """

    # if you're interested in the div tag
    element = BeautifulSoup(markup, 'lxml').select_one("#root")

    print(get_opening_tag(element))


if __name__ == '__main__':
    test()

Upvotes: 1

JimYuill

Reputation: 76

There is a way to do this with BeautifulSoup and a simple reg-ex:

Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
For the contents between the opening (<p>) and closing (</p>) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).
Then, soupParagraph will just have the opening and closing tags.
Convert soupParagraph to HTML text-format and store that in a string variable
To get the opening tag, use a regular expression to remove the closing tag from the string variable.

In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.

A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.

Can I have attributes on closing tags?

HTML Comments inside Opening Tag of the Element

This code gets the opening tag from a <body...> ... </body> section. The code has been tested.

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")

Upvotes: 1

alecxe

Reputation: 473863

There is no direct way to get opening and closing parts of the tag in BeautifulSoup, but, at least, you can get the name of it:

>>> from bs4 import BeautifulSoup
>>> 
>>> html_content = """
... <body>
...     <p>test</p>
... </body>
...  """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p

With html.parser though you can listen to "start" and "end" tag "events".

Upvotes: 2

How to get the opening and closing tag in beautiful soup from HTML string?

Answers (4)

Related Questions