Hachiko1337
Hachiko1337

Reputation: 23

Is there a way to retrieve the HTML content of a web page by casting it into a string in Python?

I am trying to retrieve the HTML content of a web page and extract it and read it as a string. However, I have a problem, whenever I run my code I get a bytes like object instead of a string and decode() does not seem to work in this case.

My code is the following:

money_request = urllib.request.urlopen('website-url-here').read()

print(money_request.decode('utf-8')

Running the above script will yield the following error:

Traceback (most recent call last):
  File "E:\University Stuff\Licenta\gas_station_service.py", line 12, in <module>
    print(money_request.decode())
  File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u02bb' in position 143288: character maps to <undefined>
>>> 

I would also like to specify that I have checked if the website uses utf-8 encoding using the Chrome console and the command document.characterSet.

I need to retrieve this as a string in order to perform a search on the lines of code to get a value from a span tag.

Any help is appreciated.

Upvotes: 1

Views: 126

Answers (2)

Oli
Oli

Reputation: 26

You can simply use the text to get a string of the website html code

import requests
response = requests.get('website-url-here')
print(response.text)

Upvotes: 0

Umutambyi Gad
Umutambyi Gad

Reputation: 4101

may be would be better if you use the beautiful soup because it help to parse into html if you don't have this module install it like pip install bs4 on windows and pip3 install bs4 if on mac or linux and i hope requests already exists in python 3 and if you don't have lxml module go ahead and install it with pip install

import requests
from bs4 import BeautifulSoup

res = request.get('website-url-here')
src = res.content
soup = BeautifulSoup(src, 'lxml')
markup = soup.prettify()
print(markup)

and you'll get the entire page of the scraping web may be would would be easy for you to extract the useful on by finding the contents that you want

soup.find_all('div', {'class', 'classname'})

this will return into array while this don't

soup.find('div', {'class', 'classname'})

but this will return the first content the choice is yours

Upvotes: 1

Related Questions