Reputation: 5344
For example
you have html like
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
python:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
print(soup.prettify())
And if you parse it using BeautifulSoup in python and print it with prettify it will give output like this
output:
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</meta>
</meta>
</meta>
</meta>
</meta>
</head>
but if you have html meta tag like
<meta name="description" content="Free Web tutorials" />
It will give output as it is. It won't add an ending tag
so how to stop BeautifulSoup from adding unnecessary ending tags?
Upvotes: 2
Views: 405
Reputation: 5344
To solve this you just need to change your html
parser to lxml
parser
then you python script will be
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'lxml')
print(soup.prettify())
you just need to change soup = bs(page.data, 'html.parser')
to soup = bs(page.data, 'lxml')
Upvotes: 2