Reputation: 5510
For some internal consistency checking tool, I am trying to assemble a list of all hyperlinks (external references, images, etc ..) in an rst using python3.
I managed to parse the rst and walk the tree using the code below:
parser = docutils.parsers.rst.Parser()
components = (docutils.parsers.rst.Parser,)
settings = docutils.frontend.OptionParser(components=components).get_default_values()
document = docutils.utils.new_document('<rst-doc>', settings=settings)
parser.parse(f, document)
class MyVisitor(docutils.nodes.NodeVisitor):
def visit_reference(self, node: docutils.nodes.reference) -> None:
"""Called for "reference" nodes."""
print("reference", node)
def unknown_visit(self, node: docutils.nodes.Node) -> None:
"""Called for all other node types."""
print("unknown_visit", node)
visitor = MyVisitor(document)
document.walk(visitor)
However, I am now completely stuck on how to find references to images and external links (URLs) within the result.
Does anyone know how to retrieve these external links programmatically from the parsed document?
Upvotes: 1
Views: 154
Reputation: 21
Yes use a regular expression library
https://www.w3schools.com/python/python_regex.asp
You should be able to match for something like (http[^\s]*)
which means match the text "http" followed by zero or more characters from the inverse set of whitespace (anything except a whitespace character)
Upvotes: 1