Reputation: 797
I have some problem. I remove some tag from html. But I want the output don't have empty line. Like this one.
<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
My expected output is
<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
How to remove empty line in html? Can I use beautifulsoup? Or any library?
UPDATE
i try to combine my code with @elethan 's anwer but i got some error
my code is
from list import get_filepaths
from bs4 import BeautifulSoup
from bs4 import Comment
filenames = get_filepaths(r"C:\Coba")
index = 0
for f in filenames:
file_html=open(str(f),"r")
soup = BeautifulSoup(file_html,"html.parser")
[x.extract() for x in soup.find_all('script')]
[x.extract() for x in soup.find_all('style')]
[x.extract() for x in soup.find_all('meta')]
[x.extract() for x in soup.find_all('noscript')]
[x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
index += 1
stored_file = "PreProcessing\extracts" + '{0:03}'.format(index) + ".html"
filewrite = open(stored_file, "w")
filewrite.write(str(soup) + '\n')
with open(stored_file, 'r+') as f:
lines = [i for i in f.readlines() if i and i != '\n']
f.seek(0)
f.writelines(lines)
f.truncate()
filewrite.close
but i got the output like this (sorry cant paste the html) actually its pretty good in the begining but almost the ending there nul nul nul (like the screenshoot).
how to remove the nul nul nul?
Upvotes: 0
Views: 1599
Reputation: 12168
Yes, you can use Beautifulsoup, and it's very simple.
BS4 will try to fix the broken html tag, like the last line </body></html>
and remove the white space. The results of different parser will be slightly different, and the 'lxml' parser performs well.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(str(soup))
out:
<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
</body></html>
Upvotes: -1
Reputation: 17003
In your code, first remove all the extra newlines from the file:
with open(my_html_file) as f:
lines = [i for i in f.readlines() if i and i != '\n']
Then write the filtered text back to the file:
with open(my_html_file, 'w') as f:
f.writelines(lines)
Or to do the whole thing in a single with
block:
with open(my_html_file, 'r+') as f:
lines = [i for i in f.readlines() if i and i != '\n']
f.seek(0)
f.writelines(lines)
f.truncate()
Depending on your existing code (which you should add to your question), you might be able to simply add the filtering part of my code to what you have.
Upvotes: 2