Unable to get webpage using aiohttp ClientSession

Question

I would like to use asyncio to get the webpage.

However, when I executed the code below, no page is obtained.

The code is

import aiofiles
import aiohttp
from aiohttp import ClientSession
import asyncio

async def get_webpage(url, session):
    try:
        res = await session.request(method="GET", url=url)
        html = await res.text(encoding='GB18030')
        return 0, html
    except:
        return 1, []

async def main_get_webpage(urls):
    webpage = []
    connector = aiohttp.TCPConnector(limit=60)       
    async with ClientSession(connector=connector) as session:
        tasks = [get_webpage(url, session) for url in urls]
        result = await asyncio.gather(*tasks)
        for status, data in result:
            print(status)
            if status == 0:
                webpage.append(data)
        return webpage

if __name__ == '__main__':
    urls = ['https://lcdsj.fang.com/house/3120178164/fangjia.htm', 'https://mingliugaoerfuzhuangyuan0551.fang.com/house/2128242324/fangjia.htm']
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop = asyncio.get_event_loop()
    webpage =  loop.run_until_complete(main_get_webpage(urls))

I expect two zeros will be printed in the function main_get_webpage(urls).

However, two ones are printed.

What's wrong with my code?

How to fix the problem?

Thank you very much.

user4815162342 · Accepted Answer

What's wrong with my code?

What's wrong is that you have a try: ... except: that masks the source of the problem. If you remove the except clause, you will find an error message that communicates the underlying issue:

UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb7 in position 47676: illegal multibyte sequence

The web page is not encoded as GB18030. The page declares itself as GB2312 (a pre-cursor to GB18030), but using that as the coding also fails.

How to fix the problem?

Depending on what you want to do with the web page text, you have several options:

Find an encoding supported by Python that works with the page as given. This is the ideal option, but I wasn't able to find it with a short search. (Using this answer to find out what chrome thinks the page uses didn't help either, because the response was GBK, which against produces an error on character 47676.)
Decode the page with a more relaxed error handler, such as res.text(encoding='GB18030', errors='replace'). That will give you a good approximation of the text, with the undecipherable bytes rendered as the unicode replacement character. This is a good option if you need to search the page for a substring or analyze it as text, and don't care about a weird character somewhere in it.
Give up the idea of decoding the page as text, and just use res.data() to get the bytes. This option is best if you need to archive or cache the page, or index it.

Unable to get webpage using aiohttp ClientSession

Answers (2)

Related Questions