galdino
galdino

Reputation: 13

web scraping trouble - Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

I tried to scrape a website with urllib and beautifulsoup (python 3.9) but I still have the same error message "Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER" with special caracters as below:

��T�w?.��m����%�%z��%�H=S��$S�YYyi�ABD�x�!%��f36��\�Y�j�46f����I��9��!D��������������������b7�3�8��JnH�t���mړBm���<���,�zR�m��A�g��{�XF%��&)�6zy��' �)a�Fo �����N舅,���~?w�w� �7z�Y6N������Q��ƣA��,p�8��/��W��q�$ ���#e�J7�#� 5�X�z�Ȥ�&q��8 ��H"����I0�����͂8ZY}J�m��c}&5e��? "/>[�7X�?NF4r���[k��6�X?��VV��H�J$j�6h��e�C��]<�V��z D ����"d�nje��{���+YL��*�X?a���m�������MNn�+��1=b$�N�4p�0���/�h�'�?�,�[��V��$�D���Z��+�?�x�X�g����

I read some topics about this problem but I don't find the solution in my case. Below, my code :

url = "https://www.fnac.com"
hdr = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
        "Accept": "*/*",
        "Accept-Encoding" : "gzip, deflate, br",
        "Accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
        "Connection" : "keep-alive"}
req = urllib.request.Request(url, headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page, "html.parser", from_encoding="utf-8")
    #divs = soup.findAll('div')
    #href = [i['href'] for i in soup.findAll('a', href=True)]
    print(soup)

else:
    print("failed!")

I tried to change encoding mode by ASCII or iso-8858-(1...9) but the problem is stil the same.

Thanks for your help :)

Upvotes: 1

Views: 950

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Remove Accept-Encoding from the HTTP headers:

import urllib
from bs4 import BeautifulSoup

url = "https://www.fnac.com"
hdr = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
    "Accept": "*/*",
    # "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection": "keep-alive",
}
req = urllib.request.Request(url, headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page, "html.parser", from_encoding="utf-8")
    # divs = soup.findAll('div')
    # href = [i['href'] for i in soup.findAll('a', href=True)]
    print(soup)

else:
    print("failed!")

Prints:


<!DOCTYPE html>

<html class="no-js" lang="fr-FR">
<head><meta charset="utf-8"/> <!-- entry: inline-kameleoon -->


...

Upvotes: 3

Related Questions