Reputation: 1413
I am trying to convert an HTML page to text and store it in a file. I am able to, however there's some random slashes and stars in the file.
Here's the code that I am using
import html2text
from bs4 import BeautifulSoup
import requests as r
url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html")
# print(html2text.html2text(url.text))
web_text = url.text
file = open('text', 'w+')
file.write(html2text.html2text(web_text.replace("** \----", "")))
file.close()
here's the output that I get.
HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018
FROM: JONNY HAMMOND / AFFINITY TANKERS
HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018
===========================================================================
DATE VESSEL DWT YR PORT OPEN FLEET COMMENT
\---- \------ \--- -- ---- \---- \----- \-------
23/10 **KRISJANIS VALDEMA 37 07 MALTA 23/10 LATVIAN SUBS**
expected format
HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018
FROM: JONNY HAMMOND / AFFINITY TANKERS
HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018
===========================================================================
DATE VESSEL DWT YR PORT OPEN FLEET COMMENT
---- ------ --- -- ---- ---- ----- -------
23/10 KRISJANIS VALDEMA 37 07 MALTA 23/10 LATVIAN SUBS
Upvotes: 0
Views: 2386
Reputation: 8270
You can remove unnecessary symbols using replace
:
from html2text import html2text
import requests as r
html = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").text
text = html2text(html).replace('*', '').replace('\-', '')
with open('text.txt', 'w') as f:
f.write(text)
Output would be:
HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018
FROM: JONNY HAMMOND / AFFINITY TANKERS
HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018
===========================================================================
DATE VESSEL DWT YR PORT OPEN FLEET COMMENT
--- ----- -- -- ---- --- ---- ------
23/10 KRISJANIS VALDEMA 37 07 MALTA 23/10 LATVIAN SUBS
25/10 SEAVALOUR 47 07 GREECE 23/10 THENAMARIS SUBS
Upvotes: 1
Reputation: 1184
If it is not necessary to use beatifulsoup
you can use html2text
library for rendering. My opinion, It is more reliable for converting html to text.
import html2text
htmlForRender = open("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").read()
print html2text.html2text(htmlForRender)
Edit for code fix with request library :
url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html")
print html2text.html2text(url)
Upvotes: 0