Nihal
Nihal

Reputation: 5344

BeautifulSoup parser adds unnecessary closing html tags

For example

you have html like

<head>
  <meta charset="UTF-8">
  <meta name="description" content="Free Web tutorials">
  <meta name="keywords" content="HTML,CSS,XML,JavaScript">
  <meta name="author" content="John Doe">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

python:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')

print(soup.prettify())

And if you parse it using BeautifulSoup in python and print it with prettify it will give output like this

output:

<html>
<head>
  <meta charset="UTF-8">
    <meta name="description" content="Free Web tutorials">
        <meta name="keywords" content="HTML,CSS,XML,JavaScript">
            <meta name="author" content="John Doe">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                </meta>
             </meta>
         </meta>
     </meta>
  </meta>
</head>

but if you have html meta tag like

<meta name="description" content="Free Web tutorials" />

It will give output as it is. It won't add an ending tag

so how to stop BeautifulSoup from adding unnecessary ending tags?

Upvotes: 2

Views: 405

Answers (1)

Nihal
Nihal

Reputation: 5344

To solve this you just need to change your html parser to lxml parser

then you python script will be

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'lxml')

print(soup.prettify())

you just need to change soup = bs(page.data, 'html.parser') to soup = bs(page.data, 'lxml')

Upvotes: 2

Related Questions