QuiteClose
QuiteClose

Reputation: 686

How do I convert a docutils document tree into an HTML string?

I'm trying to use the docutils package to convert ReST to HTML. This answer succinctly uses the docutils publish_* convenience functions to achieve this in one step. The ReST documents that I want to convert have multiple sections that I want to separate in the resulting HTML. As such, I want to break this process down:

  1. Parse the ReST into a tree of nodes.
  2. Separate the nodes as appropriate.
  3. Convert the nodes I want into HTML.

It's step three that I'm struggling with. Here's how I do steps one and two:

from docutils import utils
from docutils.frontend import OptionParser
from docutils.parsers.rst import Parser

# preamble
rst = '*NB:* just an example.'   # will actually have many sections
path = 'some.url.com'
settings = OptionParser(components=(Parser,)).get_default_values()

# step 1
document = utils.new_document(path, settings)
Parser().parse(rst, document)

# step 2
for node in document:
   do_something_with(node)

# step 3: Help!
for node in filtered(document):
   print(convert_to_html(node))

I've found the HTMLTranslator class and the Publisher class. They seem relevant but I'm struggling to find good documentation. How should I implement the convert_to_html function?

Upvotes: 7

Views: 1818

Answers (2)

bobismijnnaam
bobismijnnaam

Reputation: 1505

This is not exposed by default, but can be achieved by using the Publisher class (https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/core.py#l36). Combined with the implementation of publish_parts, the way to go is clear: create your own Publisher instance, set the settings you want, do your parsing and set the internal document field to your thing, apply the transformations you want, and then use writer.assemble_parts() to get the parts that you can then later extract.

The only downside is that it's not clear to me if this is all still public API usage, so it might break every once in a while. I don't have any long term experience with docutils, but it seems like a pretty stable project to me, so personally I don't worry about that.

Old answer

You need publish_parts from docutils.core!

import docutils
from docutils.core import publish_parts

if __name__ == "__main__":
  # Convert a string:
  parts = publish_parts(
    source = "Hello\n========\n\nThis is my document.",
    writer_name = "html5"
  )
  # Prints only "<p>This is my document.</p>"
  print(parts["body"])

  # To convert a file:
  parts = publish_parts(
    source = None,
    source_path = "path/to/doc.rst",
    source_class = docutils.io.FileInput,
    writer_name = "html5"
  )
  print(parts["body"])

Don't forget you still need to do some string substitution if you want to use the parts, e.g. as mentioned in the docs the encoding still needs to be set in the output even if you use the "whole" part.

To see what parts are available have a look at the docs: https://docutils.sourceforge.io/docs/api/publisher.html#publish-parts

Upvotes: 0

QuiteClose
QuiteClose

Reputation: 686

My problem was that I was trying to use the docutils package at too low a level. They provide an interface for this sort of thing:

from docutils.core import publish_doctree, publish_from_doctree

rst = '*NB:* just an example.'

# step 1
tree = publish_doctree(rst)

# step 2
# do something with the tree

# step 3
html = publish_from_doctree(tree, writer_name='html').decode()
print(html)

Step one is now much simpler. That said, I'm still slightly dissatisfied with the result; I realise that what I really want is a publish_node function. If you know a better way please do post it.

I should also note that I haven't managed to get this working with Python 3.

The real lesson

What I was actually trying to do was extract all of the sidebar elements from the doctree so they can be handled separately to the main body of the article. This is not the sort of use case that docutils was intended to solve. Hence no publish_node function.

Once I realised this, the correct approach was simple enough:

  1. Generate the HTML using docutils.
  2. Extract the sidebar elements using BeautifulSoup.

Here's the code that got the job done:

from docutils.core import publish_parts
from bs4 import BeautifulSoup

rst = get_rst_string_from_somewhere()

# get just the body of an HTML document 
html = publish_parts(rst, writer_name='html')['html_body']
soup = BeautifulSoup(html, 'html.parser')

# docutils wraps the body in a div with the .document class
# we can just dispose of that div altogether
wrapper = soup.select('.document')[0]
wrapper.unwrap()

# knowing that docutils gives all sidebar elements the
# .sidebar class makes extracting those elements easy
sidebar = ''.join(tag.extract().prettify() for tag in soup.select('.sidebar'))

# leaving the non-sidebar elements as the document body
body = soup.prettify()

Upvotes: 8

Related Questions