Reputation: 2948

Retrieve markdown document structure (with pandoc?)

I have a set of md documents written by a group of people. Each md file is a characterization/description of an object and has identical structure (sections, each section with identical title). You can think of it as a form written in Markdown.

Each file looks something like:

# Object "foobar"

## Color

The object is red with pink dots.    

## Shape

Square-like    

## Texture

Smooth like a glass.

and, say, snafoo.md:

# Object "snafoo"

## Color

The object is green with black stripes.    

## Shape

Ball-like

## Texture

Rough. A bit like sandpaper.

and so on...

I would like to automatically "merge" these files so that content of matching sections is concatenated. Based on the two files above the I would like to get an output like:

# Color

The object is red with pink dots.    

The object is green with black stripes.    

# Shape

Square-like

Ball-like

# Texture

Smooth like a glass.

Rough. A bit like sandpaper.

What I found to be useful is to use pandoc to convert md file to a docbook format, which is essentially an XML-like format so it is easy to parse it and retrieve the structure using existing XML tools. OPML output seems a good candidate too.

I think either docbook or OPML would be acceptable. I would write a script that merges the appropriate sections (appends their content from different documents).

However, pandoc translates all the special characters to HTML codes like ", &amp and so on. What I would like is to be able to extract, say, specific subsections in the hierarchy but have the text (e.g. body of a subsection) exactly as it is in the original md file. How would you convert the HTML codes (" etc.) back to ASCII so that everything can be rendered to PDF/DOC/... with pandoc?

Upvotes: 2

Retrieve markdown document structure (with pandoc?)

Answers (2)

Related Questions