Reputation: 800
Consider the sample HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Testing</title>
</head>
<body>
<a href="https://www.google.com">
<table>
<tr>
<td>Hello</td>
</tr>
</table>
</a>
</body>
</html>
On using BeautifulSoup on this via:
html_soup = BeautifulSoup(html_source_code,"lxml")
I get:
<!DOCTYPE html>
<html lang="en">
<title>Testing</title>
</head>
<body>
<a href="https://www.google.com">
</a>
<table>
<tr>
<td>Hello</td>
</tr>
</table>
</body>
</html>
Note how the table is no longer contained within the anchor tag, thereby altering the output.
I have run the source code through online validators (e.g. https://validator.w3.org/) and they return no errors or warning, and so I believe there is nothing wrong with the HTML code itself.
Why does BS cause this error, and how can I fix it? p.s. Not trivial for me (in my real use case) to move the tags to inner elements owing to pre-defined and CSS and JS features.
Upvotes: 1
Views: 31
Reputation: 82765
Use "html.parser"
Ex:
from bs4 import BeautifulSoup
html_source_code = """<!DOCTYPE html>
<html lang="en">
<head>
<title>Testing</title>
</head>
<body>
<a href="https://www.google.com">
<table>
<tr>
<td>Hello</td>
</tr>
</table>
</a>
</body>
</html>"""
html_soup = BeautifulSoup(html_source_code,"html.parser")
print(html_soup.prettify(formatter='html'))
Output:
<!DOCTYPE html>
<html lang="en">
<head>
<title>
Testing
</title>
</head>
<body>
<a href="https://www.google.com">
<table>
<tr>
<td>
Hello
</td>
</tr>
</table>
</a>
</body>
</html>
Upvotes: 2