BeautifulSoup for Mandarin

Question

I'm trying to scrape a site in Mandarin using BeautifulSoup. Unfortunately, when I do, BeautifulSoup finds the html, head, and body tags, but everything in between the opening and closing body tags is gibberish. I've tried using multiple parsers, and as far as I can tell only html5lib is able to find all of the page because it returns by far the longest result. So I think I'm using the right parser, but the encoding is wrong. The website lists 'gb2312' as its encoding, but using that encoding, it is still gibberish. I also tried chardet to determine the encoding, which returned 'windows-1252', but it also didn't seem correct. Indeed I have gone through many of the standard Chinese character encodings (found here), but none of them return anything coherent, although some have one or two Chinese characters. I also created a output file for every possible python encoding, but it looks like none of them are correct.

Other than going through the different encodings, I'm not sure what else to try. Any help would be greatly appreciated, thanks!

BeautifulSoup for Mandarin

Answers (1)

Related Questions