Derive correct URLs from HTML JS src tag

Question

I'm parsing HTML websites using Python3's html.parser to search for all included JavaScript files. For that I loop through all script tags and retrieve the content of the src attribute.

The challenge is to build the correct URLs. The src attribute may contain a full-qualified URL like https://example.com/jsfile.js but it may also be that it just contains a relative path. In these cases I have to set the scheme (http or https) and domain / network location manually.

As I could not come up with a reliably-working solution: does anybody have an idea how I can do this in Python3.5?

Thanks in advance, Andy

WAS · Accepted Answer

use

urllib.parse.urljoin to get full urls

If its full path it will return as it is, and if its relative it will return full path.

here is an example:

Derive correct URLs from HTML JS src tag

Answers (1)

Related Questions