Reputation: 5565
I am looking to identify the DNS requests that would be made on opening a html file (using Python). Specifically, I am looking to see what domains resources would be loaded from, were that page opened in a web-browser. I do not actually want to make the DNS requests, or load the external resources, just identify what they would be (or more specifically, where they would be coming from).
(I have a bunch [millions] of html files, and I want to identify what domains each would try and load external resources from).
I assume there must be a Python package that can assist with this, but can't seem to find it - looking for a point in the right direction, rather than fully developed code.
Upvotes: 0
Views: 304
Reputation: 4177
Sorry to say, but, rare enough, Python will be the last thing you need to achieve your goal. This is because with Python you can neither interpret HTML in a way that the dependent web requests you are after will be issued, nor is Python the best tool to hook into DNS lookup on your machine.
I would rather suggest to use a scriptable headless browser (like PhantomJS
) to request all the HTML pages in your archive (at best via a local web server). The headless server will then not only read the HTML source (as a python requests.get
or so would do) but also interpret embedded JavaScript and load remote link
s (like CSS-stylesheets), images etc. Only this will produce the DNS lookups you want to learn about.
You should then install a local "spy" DNS-server that you can control to find out which DNS-entries are looked up. A great tutorial, how to set up such a server under linux can be found here. And yes, there is also room for Python, because you will want to analyze and condense the log file of your "spy" DNS-server.
Upvotes: 1