Chan
Chan

Reputation: 4301

Unable to get webpage using aiohttp ClientSession

I would like to use asyncio to get the webpage.

However, when I executed the code below, no page is obtained.

The code is

import aiofiles
import aiohttp
from aiohttp import ClientSession
import asyncio

async def get_webpage(url, session):
    try:
        res = await session.request(method="GET", url=url)
        html = await res.text(encoding='GB18030')
        return 0, html
    except:
        return 1, []

async def main_get_webpage(urls):
    webpage = []
    connector = aiohttp.TCPConnector(limit=60)       
    async with ClientSession(connector=connector) as session:
        tasks = [get_webpage(url, session) for url in urls]
        result = await asyncio.gather(*tasks)
        for status, data in result:
            print(status)
            if status == 0:
                webpage.append(data)
        return webpage

if __name__ == '__main__':
    urls = ['https://lcdsj.fang.com/house/3120178164/fangjia.htm', 'https://mingliugaoerfuzhuangyuan0551.fang.com/house/2128242324/fangjia.htm']
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop = asyncio.get_event_loop()
    webpage =  loop.run_until_complete(main_get_webpage(urls))

I expect two zeros will be printed in the function main_get_webpage(urls).

However, two ones are printed.

What's wrong with my code?

How to fix the problem?

Thank you very much.

Upvotes: 1

Views: 956

Answers (2)

Qiulang
Qiulang

Reputation: 12505

I think a better way maybe just use await res.text() instead of await res.text(encoding='GB18030') because as https://docs.aiohttp.org/en/stable/client_reference.html?highlight=encoding#aiohttp.ClientResponse.text said

If encoding is None content encoding is autocalculated using Content-Type HTTP header and chardet tool if the header is not provided by server.

I will argue that if aiohttp didn't use the charset in Content-Type to decode the response text, its implementation is rather problematic. You really don't need to provide encoding parameter.

I check the 2 urls in your example, Content-Type are both text/html; charset=utf-8 so you can't use GB18030 to decode them.

Upvotes: 1

user4815162342
user4815162342

Reputation: 155630

What's wrong with my code?

What's wrong is that you have a try: ... except: that masks the source of the problem. If you remove the except clause, you will find an error message that communicates the underlying issue:

UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb7 in position 47676: illegal multibyte sequence

The web page is not encoded as GB18030. The page declares itself as GB2312 (a pre-cursor to GB18030), but using that as the coding also fails.

How to fix the problem?

Depending on what you want to do with the web page text, you have several options:

  1. Find an encoding supported by Python that works with the page as given. This is the ideal option, but I wasn't able to find it with a short search. (Using this answer to find out what chrome thinks the page uses didn't help either, because the response was GBK, which against produces an error on character 47676.)

  2. Decode the page with a more relaxed error handler, such as res.text(encoding='GB18030', errors='replace'). That will give you a good approximation of the text, with the undecipherable bytes rendered as the unicode replacement character. This is a good option if you need to search the page for a substring or analyze it as text, and don't care about a weird character somewhere in it.

  3. Give up the idea of decoding the page as text, and just use res.data() to get the bytes. This option is best if you need to archive or cache the page, or index it.

Upvotes: 2

Related Questions