Sahil
Sahil

Reputation: 1413

Converting HTML to TXT

I am trying to convert an HTML page to text and store it in a file. I am able to, however there's some random slashes and stars in the file.

Here's the code that I am using

import html2text 
from bs4 import BeautifulSoup
import requests as r 


url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html")

# print(html2text.html2text(url.text))
web_text = url.text
file = open('text', 'w+')
file.write(html2text.html2text(web_text.replace("** \----", "")))
file.close()

here's the output that I get.

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT  

\----  \------           \--- -- ----           \----  \-----       \-------  

23/10 **KRISJANIS VALDEMA 37 07 MALTA           23/10 LATVIAN     SUBS**  

expected format

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT       

----  ------           --- -- ----           ----  -----       -------       

23/10 KRISJANIS VALDEMA 37 07 MALTA          23/10 LATVIAN     SUBS  

Upvotes: 0

Views: 2386

Answers (2)

Alderven
Alderven

Reputation: 8270

You can remove unnecessary symbols using replace:

from html2text import html2text
import requests as r

html = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").text
text = html2text(html).replace('*', '').replace('\-', '')
with open('text.txt', 'w') as f:
    f.write(text)

Output would be:

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT


---  -----           -- -- ----           ---  ----       ------  

23/10 KRISJANIS VALDEMA 37 07 MALTA           23/10 LATVIAN     SUBS  



25/10 SEAVALOUR          47 07 GREECE         23/10 THENAMARIS  SUBS

Upvotes: 1

sayhan
sayhan

Reputation: 1184

If it is not necessary to use beatifulsoup you can use html2text library for rendering. My opinion, It is more reliable for converting html to text.

import html2text

htmlForRender = open("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").read()

print html2text.html2text(htmlForRender)

Edit for code fix with request library :

url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html")

print html2text.html2text(url)

Upvotes: 0

Related Questions