Reputation: 49
When running my codes, I get this error
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 71: ordinal not in range(128)
This is my whole codes,
from urllib.request import urlopen as uReq
from urllib.request import urlretrieve as uRet
from bs4 import BeautifulSoup as soup
import urllib
for x in range(143, 608):
myUrl = "example.com/" + str(x)
try:
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("div", {"id": "videoPostContent"})
img_container = container[0].findAll("img")
images = img_container[0].findAll("img")
imgCounter = 0
if len(images) == "":
for image in images:
print('Downloading image from ' + image['src'] + '...')
imgCounter += 1
uRet(image['src'], 'pictures/' + str(x) + '.jpg')
else:
for image in img_container:
print('Downloading image from ' + image['src'] + '...')
imgCounter += 1
uRet(image['src'], 'pictures/' + str(x) + '_' + str(imgCounter) + '.jpg')
except urllib.error.HTTPError:
continue
Tried Solutions:
I tried adding .encode/decode('utf-8')
and .text.encode/decode('utf-8')
to page_soup
but it gives this errors.
AttributeError: 'str' / 'bytes' object has no attribute 'findAll' or
Upvotes: 0
Views: 353
Reputation: 55670
At least one of the image src urls contain non-ascii characters, and urlretrieve
is unable to process them.
>>> url = 'http://example.com/' + '\u0303'
>>> urlretrieve(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
UnicodeEncodeError: 'ascii' codec can't encode character '\u0303' in position 5: ordinal not in range(128)
You could try one of these approaches to get around this problem.
Assume that these urls are valid, and retrieve them using a library that has better unicode handling, like requests.
Assume that the urls are valid, but contain unicode characters that must be escaped before passing to urlretrieve
. This would entail splitting the url into scheme, domain, path etc, quoting the path and any query parameters and then unsplitting; all the tools for this are in the urllib.parse package (but this is probably what requests does anyway, so just use requests).
Assume that these urls are broken and skip them by wrapping your urlretrieve
calls with try/except UnicodeEncodeError
Upvotes: 0