f.zs
f.zs

Reputation: 1

python-requests and urllib not giving the same HTML as seen in browser, target site only contains text (no apparent scripts)

I have the following url: https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc

It simply contains text, and I want to download it and store it on my disk as an xml file using Python. I'm using the requests module. Here is what I've tried doing:

import requests

url = "https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc"

r = requests.get(url, allow_redirects=True)
open('test.xml', 'wb').write(r.content)

When I go to inspect the contents of test.xml, it only contains the text "PLEASE DOWNLOAD RAW FILE". I've also tried using urllib.request.urlopen(), but I get the same result.

However when I open the url in a browser, I see the full markup text, and I can even download the page as save it as an xml.

The HTML that I receive, using the requests method, is:

<html>
   <body>
      <p>PLEASE DOWNLOAD RAW FILE</p>
   </body>
</html>>

But the HTML on the site is like this

The text that I want to download is on the left. The HTML is displayed on the right. If I can just get the HTML that's on the right, then I know how to use something like BeautifulSoup to parse it and get what I want. But I'm not sure why python-requests and urllib is not giving me the right data.

Upvotes: 0

Views: 625

Answers (1)

Mureinik
Mureinik

Reputation: 311163

That site seems to check the user-agent sent in the request.

If you explicitly set a browser-like user-agent in your request, you'll get the response you're trying to get:

import requests

url = "https://tenhou.net/3/mjlog2xml.cgi?2009042400gm-00b9-0000-3a2a55dc"

# Create a dictionary of the headers including the User-Agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}


r = requests.get(url, headers=headers, allow_redirects=True)
open('test.xml', 'wb').write(r.content)

Upvotes: 1

Related Questions