Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

Question

I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4. However, for this particular chinese website the soup cannot be parsed properly. Below is the code i use to set up the soup:

start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")

An example of the printed soup is the following:

Current soup

Note: I had to add a picture as Stack though it was spam :)

The above should have looked like the following:

Proper soup

I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.

Thanks in advance!

chitown88 · Accepted Answer

I don't know Chinese. Does this give the desired results?

import requests
from bs4 import BeautifulSoup as bs

start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")

print(soup)

Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

Answers (1)

Related Questions