Shailesh Appukuttan
Shailesh Appukuttan

Reputation: 800

BeautifulSoup not handling HTML table inside anchor tag

Consider the sample HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Testing</title>
</head>
<body>
    <a href="https://www.google.com">
        <table>
            <tr>
                <td>Hello</td>
            </tr>
        </table>
    </a>
</body>
</html>

On using BeautifulSoup on this via: html_soup = BeautifulSoup(html_source_code,"lxml") I get:

<!DOCTYPE html>
<html lang="en">
    <title>Testing</title>
</head>
<body>
    <a href="https://www.google.com">
    </a>
    <table>
        <tr>
            <td>Hello</td>
        </tr>
    </table>
</body>
</html>

Note how the table is no longer contained within the anchor tag, thereby altering the output.

I have run the source code through online validators (e.g. https://validator.w3.org/) and they return no errors or warning, and so I believe there is nothing wrong with the HTML code itself.

Why does BS cause this error, and how can I fix it? p.s. Not trivial for me (in my real use case) to move the tags to inner elements owing to pre-defined and CSS and JS features.

Upvotes: 1

Views: 31

Answers (1)

Rakesh
Rakesh

Reputation: 82765

Use "html.parser"

Ex:

from bs4 import BeautifulSoup

html_source_code = """<!DOCTYPE html>
<html lang="en">
<head>
    <title>Testing</title>
</head>
<body>
    <a href="https://www.google.com">
        <table>
            <tr>
                <td>Hello</td>
            </tr>
        </table>
    </a>
</body>
</html>"""

html_soup = BeautifulSoup(html_source_code,"html.parser")
print(html_soup.prettify(formatter='html'))

Output:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Testing
  </title>
 </head>
 <body>
  <a href="https://www.google.com">
   <table>
    <tr>
     <td>
      Hello
     </td>
    </tr>
   </table>
  </a>
 </body>
</html>

Upvotes: 2

Related Questions