kyrenia
kyrenia

Reputation: 5565

Capture DNS requests on opening html using Python

I am looking to identify the DNS requests that would be made on opening a html file (using Python). Specifically, I am looking to see what domains resources would be loaded from, were that page opened in a web-browser. I do not actually want to make the DNS requests, or load the external resources, just identify what they would be (or more specifically, where they would be coming from).

(I have a bunch [millions] of html files, and I want to identify what domains each would try and load external resources from).

I assume there must be a Python package that can assist with this, but can't seem to find it - looking for a point in the right direction, rather than fully developed code.

Upvotes: 0

Views: 304

Answers (1)

flaschbier
flaschbier

Reputation: 4177

Sorry to say, but, rare enough, Python will be the last thing you need to achieve your goal. This is because with Python you can neither interpret HTML in a way that the dependent web requests you are after will be issued, nor is Python the best tool to hook into DNS lookup on your machine.

I would rather suggest to use a scriptable headless browser (like PhantomJS) to request all the HTML pages in your archive (at best via a local web server). The headless server will then not only read the HTML source (as a python requests.get or so would do) but also interpret embedded JavaScript and load remote links (like CSS-stylesheets), images etc. Only this will produce the DNS lookups you want to learn about.

You should then install a local "spy" DNS-server that you can control to find out which DNS-entries are looked up. A great tutorial, how to set up such a server under linux can be found here. And yes, there is also room for Python, because you will want to analyze and condense the log file of your "spy" DNS-server.

Upvotes: 1

Related Questions