Reputation: 8557
I'm trying to extract content from a webpage using Requests and Beautiful Soup.
When retrieving the page content using Requests, I ran into a rather strange issue. As you can see in the screenshot (original page), Â
characters seem to be inserted at random (I've highlighted them to make it more clear).
Sample code:
from bs4 import BeautifulSoup
import requests
url = 'https://technet.microsoft.com/en-us/sysinternals/bb963902'
r = requests.get(url=url)
with open('/Users/xxxx/test.html', 'wb') as f:
f.write(r.content)
At first, I thought it had something to do with the encoding not being UTF-8, but this seems to be ok:
r.encoding
>> 'utf-8'
I've tried retrieving the same page with curl (curl 7.37.1 (x86_64-apple-darwin14.0) libcurl/7.37.1 SecureTransport zlib/1.2.5
) and the same issue appears in the output.
Upvotes: 1
Views: 857
Reputation: 56809
You receive the file correctly. Since the HTML file lacks charset information, the browser detects the wrong encoding (Western instead of Unicode) when you view the downloaded file.
It renders correctly when you browse online since the server sends the charset information in the Content-Type header.
Upvotes: 1