Reputation: 686
I'm trying to use the docutils package to convert ReST to HTML. This answer succinctly uses the docutils publish_*
convenience functions to achieve this in one step. The ReST documents that I want to convert have multiple sections that I want to separate in the resulting HTML. As such, I want to break this process down:
It's step three that I'm struggling with. Here's how I do steps one and two:
from docutils import utils
from docutils.frontend import OptionParser
from docutils.parsers.rst import Parser
# preamble
rst = '*NB:* just an example.' # will actually have many sections
path = 'some.url.com'
settings = OptionParser(components=(Parser,)).get_default_values()
# step 1
document = utils.new_document(path, settings)
Parser().parse(rst, document)
# step 2
for node in document:
do_something_with(node)
# step 3: Help!
for node in filtered(document):
print(convert_to_html(node))
I've found the HTMLTranslator
class and the Publisher
class. They seem relevant but I'm struggling to find good documentation. How should I implement the convert_to_html
function?
Upvotes: 7
Views: 1818
Reputation: 1505
This is not exposed by default, but can be achieved by using the Publisher
class (https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/core.py#l36). Combined with the implementation of publish_parts
, the way to go is clear: create your own Publisher instance, set the settings you want, do your parsing and set the internal document
field to your thing, apply the transformations you want, and then use writer.assemble_parts()
to get the parts that you can then later extract.
The only downside is that it's not clear to me if this is all still public API usage, so it might break every once in a while. I don't have any long term experience with docutils, but it seems like a pretty stable project to me, so personally I don't worry about that.
You need publish_parts
from docutils.core
!
import docutils
from docutils.core import publish_parts
if __name__ == "__main__":
# Convert a string:
parts = publish_parts(
source = "Hello\n========\n\nThis is my document.",
writer_name = "html5"
)
# Prints only "<p>This is my document.</p>"
print(parts["body"])
# To convert a file:
parts = publish_parts(
source = None,
source_path = "path/to/doc.rst",
source_class = docutils.io.FileInput,
writer_name = "html5"
)
print(parts["body"])
Don't forget you still need to do some string substitution if you want to use the parts, e.g. as mentioned in the docs the encoding still needs to be set in the output even if you use the "whole" part.
To see what parts are available have a look at the docs: https://docutils.sourceforge.io/docs/api/publisher.html#publish-parts
Upvotes: 0
Reputation: 686
My problem was that I was trying to use the docutils package at too low a level. They provide an interface for this sort of thing:
from docutils.core import publish_doctree, publish_from_doctree
rst = '*NB:* just an example.'
# step 1
tree = publish_doctree(rst)
# step 2
# do something with the tree
# step 3
html = publish_from_doctree(tree, writer_name='html').decode()
print(html)
Step one is now much simpler. That said, I'm still slightly dissatisfied with the result; I realise that what I really want is a publish_node
function. If you know a better way please do post it.
I should also note that I haven't managed to get this working with Python 3.
What I was actually trying to do was extract all of the sidebar elements from the doctree so they can be handled separately to the main body of the article. This is not the sort of use case that docutils
was intended to solve. Hence no publish_node
function.
Once I realised this, the correct approach was simple enough:
docutils
.BeautifulSoup
.Here's the code that got the job done:
from docutils.core import publish_parts
from bs4 import BeautifulSoup
rst = get_rst_string_from_somewhere()
# get just the body of an HTML document
html = publish_parts(rst, writer_name='html')['html_body']
soup = BeautifulSoup(html, 'html.parser')
# docutils wraps the body in a div with the .document class
# we can just dispose of that div altogether
wrapper = soup.select('.document')[0]
wrapper.unwrap()
# knowing that docutils gives all sidebar elements the
# .sidebar class makes extracting those elements easy
sidebar = ''.join(tag.extract().prettify() for tag in soup.select('.sidebar'))
# leaving the non-sidebar elements as the document body
body = soup.prettify()
Upvotes: 8