Andy
Andy

Reputation: 81

Derive correct URLs from HTML JS src tag

I'm parsing HTML websites using Python3's html.parser to search for all included JavaScript files. For that I loop through all script tags and retrieve the content of the src attribute.

The challenge is to build the correct URLs. The src attribute may contain a full-qualified URL like https://example.com/jsfile.js but it may also be that it just contains a relative path. In these cases I have to set the scheme (http or https) and domain / network location manually.

As I could not come up with a reliably-working solution: does anybody have an idea how I can do this in Python3.5?

Thanks in advance, Andy

Upvotes: 1

Views: 35

Answers (1)

WAS
WAS

Reputation: 175

use

urllib.parse.urljoin to get full urls

If its full path it will return as it is, and if its relative it will return full path.

here is an example:enter image description here

Upvotes: 2

Related Questions