Reputation: 81
I'm parsing HTML websites using Python3's html.parser to search for all included JavaScript files. For that I loop through all script tags and retrieve the content of the src attribute.
The challenge is to build the correct URLs. The src attribute may contain a full-qualified URL like https://example.com/jsfile.js but it may also be that it just contains a relative path. In these cases I have to set the scheme (http or https) and domain / network location manually.
As I could not come up with a reliably-working solution: does anybody have an idea how I can do this in Python3.5?
Thanks in advance, Andy
Upvotes: 1
Views: 35
Reputation: 175
use
urllib.parse.urljoin to get full urls
If its full path it will return as it is, and if its relative it will return full path.
Upvotes: 2