Reputation: 3360
I have html file called test.html
it has one word בדיקה
.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????
, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
Upvotes: 50
Views: 270577
Reputation: 261
you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st
as a string
initially, like st=""
Upvotes: 17
Reputation: 510
You can simply use this
import requests
requests.get(url)
Upvotes: 0
Reputation: 11
CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
Upvotes: 1
Reputation: 351
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8'
:
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
Upvotes: 25
Reputation: 3918
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)
Upvotes: -2
Reputation: 4058
Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
Upvotes: 6
Reputation: 67968
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
Upvotes: 62
Reputation: 2286
You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
Upvotes: 8