Reputation: 516
I am trying to perform a get request on TCG Player via Requests
on Python
. I checked the sites robots.txt
which specifies:
User-agent: *
Crawl-Delay: 10
Allow: /
Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This is my first time seeing a robots.txt
file.
My code is as follows:
import requests
url = "http://www.tcgplayer.com"
r = requests.get(url)
print(r.text)
I cannot include r.text
in my post because the character limit would be exceeded.
I would have expected to be recieve the HTML content of the webpage, but I got an 'unfriendly' response instead. What is the meaning of the text above? Is there a way to get the HTML so I can scrape the site?
By 'unfriendly' I mean:
The HTML that is returned does not match the HTML that is produced by typing the URL into my web browser.
Upvotes: 0
Views: 55
Reputation: 940
This is probably due to some server-side rendering of web content, as indicated by the empty <div id="app"></div>
block in the scraped result. To properly handle such content, you will need to use a more advanced web scraping tool, like Selenium. I'd recommend this tutorial to get started.
Upvotes: 1