Kim Hyesung
Kim Hyesung

Reputation: 797

Python How to remove empty line in html

I have some problem. I remove some tag from html. But I want the output don't have empty line. Like this one.

<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>

</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">

</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">

</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>

My expected output is

<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>

How to remove empty line in html? Can I use beautifulsoup? Or any library?

UPDATE

i try to combine my code with @elethan 's anwer but i got some error

my code is

from list import get_filepaths
from bs4 import BeautifulSoup
from bs4 import Comment


filenames = get_filepaths(r"C:\Coba")
index = 0
for f in filenames:
    file_html=open(str(f),"r")
    soup = BeautifulSoup(file_html,"html.parser")
    [x.extract() for x in soup.find_all('script')]
    [x.extract() for x in soup.find_all('style')]
    [x.extract() for x in soup.find_all('meta')]
    [x.extract() for x in soup.find_all('noscript')]
    [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]

    index += 1
    stored_file = "PreProcessing\extracts" + '{0:03}'.format(index) + ".html"
    filewrite = open(stored_file, "w")
    filewrite.write(str(soup) + '\n')
    with open(stored_file, 'r+') as f:
        lines = [i for i in f.readlines() if i and i != '\n']
        f.seek(0)
        f.writelines(lines)
        f.truncate()
    filewrite.close

but i got the output like this (sorry cant paste the html) actually its pretty good in the begining but almost the ending there nul nul nul (like the screenshoot).

how to remove the nul nul nul? enter image description here

Upvotes: 0

Views: 1599

Answers (2)

宏杰李
宏杰李

Reputation: 12168

Yes, you can use Beautifulsoup, and it's very simple.

BS4 will try to fix the broken html tag, like the last line </body></html> and remove the white space. The results of different parser will be slightly different, and the 'lxml' parser performs well.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(str(soup))

out:

<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
</body></html>

Upvotes: -1

elethan
elethan

Reputation: 17003

In your code, first remove all the extra newlines from the file:

with open(my_html_file) as f:
    lines = [i for i in f.readlines() if i and i != '\n']

Then write the filtered text back to the file:

with open(my_html_file, 'w') as f:
    f.writelines(lines)

Or to do the whole thing in a single with block:

with open(my_html_file, 'r+') as f:
    lines = [i for i in f.readlines() if i and i != '\n']
    f.seek(0)
    f.writelines(lines)
    f.truncate()

Depending on your existing code (which you should add to your question), you might be able to simply add the filtering part of my code to what you have.

Upvotes: 2

Related Questions