Mike Boulton
Mike Boulton

Reputation: 15

I am getting an error when trying to extract content using justext

I am trying to extract content from a url using justext.

My code is as follows:

import requests
import justext
url = 'https://yoursoccerhome.com/what-is-a-cap-in-soccer-the-meaning-and-history-of-the-term/'
response = requests.get(url)
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
    print(paragraph)

There error I get is:

C:\Users\micb1\PycharmProjects\pythonProject1\venv\Scripts\python.exe C:/Users/micb1/PycharmProjects/pythonProject1/content.py
Traceback (most recent call last):
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\urllib3\response.py", line 404, in _decode
    data = self._decoder.decompress(data)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\urllib3\response.py", line 91, in decompress
    ret += self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\models.py", line 760, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\urllib3\response.py", line 579, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\urllib3\response.py", line 551, in read
    data = self._decode(data, decode_content, flush_decoder)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\urllib3\response.py", line 407, in _decode
    raise DecodeError(
urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micb1\PycharmProjects\pythonProject1\content.py", line 6, in <module>
    response = requests.get(url)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\sessions.py", line 687, in send
    r.content
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\models.py", line 838, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "C:\Users\micb1\PycharmProjects\pythonProject1\venv\lib\site-packages\requests\models.py", line 765, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

Process finished with exit code 1

This is beyond my level of programming

However if I use the url of 'https://coachingkidz.com/what-is-a-cap-in-soccer-meaning-and-significance-explained/' it works fine.

Any help on how to resolve this would be appreciated.

thanks

Upvotes: 0

Views: 212

Answers (0)

Related Questions