Michał
Michał

Reputation: 2948

Retrieve markdown document structure (with pandoc?)

I have a set of md documents written by a group of people. Each md file is a characterization/description of an object and has identical structure (sections, each section with identical title). You can think of it as a form written in Markdown.

Each file looks something like:

# Object "foobar"

## Color

The object is red with pink dots.    

## Shape

Square-like    

## Texture

Smooth like a glass.

and, say, snafoo.md:

# Object "snafoo"

## Color

The object is green with black stripes.    

## Shape

Ball-like

## Texture

Rough. A bit like sandpaper.

and so on...

I would like to automatically "merge" these files so that content of matching sections is concatenated. Based on the two files above the I would like to get an output like:

# Color

The object is red with pink dots.    

The object is green with black stripes.    

# Shape

Square-like

Ball-like

# Texture

Smooth like a glass.

Rough. A bit like sandpaper.

What I found to be useful is to use pandoc to convert md file to a docbook format, which is essentially an XML-like format so it is easy to parse it and retrieve the structure using existing XML tools. OPML output seems a good candidate too.

I think either docbook or OPML would be acceptable. I would write a script that merges the appropriate sections (appends their content from different documents).

However, pandoc translates all the special characters to HTML codes like ", &amp and so on. What I would like is to be able to extract, say, specific subsections in the hierarchy but have the text (e.g. body of a subsection) exactly as it is in the original md file. How would you convert the HTML codes (" etc.) back to ASCII so that everything can be rendered to PDF/DOC/... with pandoc?

Upvotes: 2

Views: 648

Answers (2)

z--
z--

Reputation: 2236

Two other candidates for output formats to check might be

  • JSON, use one of the many JSON viewers
  • OPML (Outline Processor Markup Language) an XML format

Upvotes: 3

Vincent Goossens
Vincent Goossens

Reputation: 171

Pandoc is written in Haskell and uses a Haskell data structure as an intermediate format for converting documents. You can choose native as output format to get that data structure. Haskell data structures are very easy to parse. You can also just use Haskell and read the data structure using the types from the pandoc package, or use the document reader functions from pandoc directly.

Upvotes: 3

Related Questions