Reputation: 1
I have the following url: https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc
It simply contains text, and I want to download it and store it on my disk as an xml file using Python. I'm using the requests module. Here is what I've tried doing:
import requests
url = "https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc"
r = requests.get(url, allow_redirects=True)
open('test.xml', 'wb').write(r.content)
When I go to inspect the contents of test.xml
, it only contains the text "PLEASE DOWNLOAD RAW FILE". I've also tried using urllib.request.urlopen()
, but I get the same result.
However when I open the url in a browser, I see the full markup text, and I can even download the page as save it as an xml.
The HTML that I receive, using the requests method, is:
<html>
<body>
<p>PLEASE DOWNLOAD RAW FILE</p>
</body>
</html>>
But the HTML on the site is like this
The text that I want to download is on the left. The HTML is displayed on the right. If I can just get the HTML that's on the right, then I know how to use something like BeautifulSoup to parse it and get what I want. But I'm not sure why python-requests and urllib is not giving me the right data.
Upvotes: 0
Views: 625
Reputation: 311163
That site seems to check the user-agent sent in the request.
If you explicitly set a browser-like user-agent in your request, you'll get the response you're trying to get:
import requests
url = "https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc"
# Create a dictionary of the headers including the User-Agent
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers, allow_redirects=True)
open('test.xml', 'wb').write(r.content)
Upvotes: 1