How can I convert a Markdown string to a DocX in Python?

Question

I am getting markdown text from my API like this:

{
    name:'Onur',
    surname:'Gule',
    biography:'## Computers
    I like **computers** so much.
    I wanna *be* a computer.',
    membership:1
}

biography column includes markdown string like above.

## Computers
I like **computers** so much.
I wanna *be* a computer.

I want to take this markdown text and convert to docx string for my reports.

In my docx template:

{{markdownText|mark2html}}

{{simpleText}}

I am using python3 docxtpl package for creating docx and it's working for simple texts.

I tried BeautifulSoup for convert markdown to docx text but it doesn't work for styles(bold, italic etc.).
I tried pandoc and it worked but it just create a docx file, I want to add rendered markdown text to existing docx(while creating).

My current code:

import docx
from docxtpl import DocxTemplate, RichText
import markdown
import jinja2
import markupsafe
from bs4 import BeautifulSoup
import pypandoc

def safe_markdown(text):
    return markupsafe.Markup(markdown.markdown(text))

def mark2html(value):
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    output = pypandoc.convert_text(value,'rtf',format='md')
    return RichText(value) #tried soup and pandoc..

def from_template(template):
    template = DocxTemplate(template)
    context = {
        'simpleText':'Simple text test.',
        'markdownText':'Markdown **text** test.'
    } 
    jenv = jinja2.Environment()
    jenv.filters['markdown'] = safe_markdown
    jenv.filters["mark2html"] = mark2html
    template.render(context,jenv)
    template.save('new_report.docx')

So, how can I add rendered markdown to existed docx or while creating, maybe with a jinja2 filter?

Onurgule · Accepted Answer

I solved it without any shortcut. I turn the markdown to html with beautifulSoup and then process every paragraph by checking theirs tag names.

In my word template:

{% if markdownText != None %}
    {% for mt in markdownText|mark2html %} 
        {{mt}}
    {% endfor %}
{% endif %}

My template tag:

def mark2html(value):
    if value == None:
        return '-'
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    paragraphs = []
    global doc
    for tag in soup.findAll(True):
        if tag.name in ('p','h1','h2','h3','h4','h5','h6'):
            paragraphs.extend(parseHtmlToDoc(tag))  
    return paragraphs

My code to insert docx:

def parseHtmlToDoc(org_tag):
    contents = org_tag.contents
    pars= []
    for con in contents:
        if str(type(con)) == "":
            tag = con
            if tag.name in ('strong',"h1","h2","h3","h4","h5","h6"):
                source = RichText("")
                if len(pars) > 0 and str(type(pars[len(pars)-1])) == "":
                    source = pars[len(pars)-1]
                    source.add(con.contents[0], bold=True)
                else:
                    source.add(con.contents[0], bold=True)
                    pars.append(source) 
            elif tag.name == 'img':
                source = tag['src']
                imagen = InlineImage(doc, settings.MEDIA_ROOT+source)
                pars.append(imagen)
            elif tag.name == 'em':
                source = RichText("")
                source.add(con.contents[0], italic=True)
                pars.append(source)
        else:
            source = RichText("")
            if len(pars) > 0 and str(type(pars[len(pars)-1])) == "":
                    source = pars[len(pars)-1]
                    pars.add(con)
            else:
                if org_tag.name == 'h2':
                    source.add(con,bold=True,size=40)
                else:
                    source.add(con)
                pars.append(source) # her zaman append?
    return pars

It process html tags like b, i, img, headers. You can add more tags to process. I solved like that and it doesn't need any additional file transform like html2docx or etc.

I used this process in my code like this:

report_context = {'reportVariables': report_variables}
template = DocxTemplate('report_format.docx')
jenv = jinja2.Environment()
jenv.filters["mark2html"] = mark2html
template.render(report_context,jenv)
template.save('exported_1.docx')

How can I convert a Markdown string to a DocX in Python?

Answers (2)

Related Questions