Reputation: 55
I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4. However, for this particular chinese website the soup cannot be parsed properly. Below is the code i use to set up the soup:
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")
An example of the printed soup is the following:
Note: I had to add a picture as Stack though it was spam :)
The above should have looked like the following:
I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.
Thanks in advance!
Upvotes: 0
Views: 1083
Reputation: 28595
I don't know Chinese. Does this give the desired results?
import requests
from bs4 import BeautifulSoup as bs
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")
print(soup)
Upvotes: 1