Reputation: 2948
I have a set of md
documents written by a group of people. Each md
file is a characterization/description of an object and has identical structure (sections, each section with identical title). You can think of it as a form written in Markdown.
Each file looks something like:
# Object "foobar"
## Color
The object is red with pink dots.
## Shape
Square-like
## Texture
Smooth like a glass.
and, say, snafoo.md
:
# Object "snafoo"
## Color
The object is green with black stripes.
## Shape
Ball-like
## Texture
Rough. A bit like sandpaper.
and so on...
I would like to automatically "merge" these files so that content of matching sections is concatenated. Based on the two files above the I would like to get an output like:
# Color
The object is red with pink dots.
The object is green with black stripes.
# Shape
Square-like
Ball-like
# Texture
Smooth like a glass.
Rough. A bit like sandpaper.
What I found to be useful is to use pandoc
to convert md
file to a docbook
format, which is essentially an XML-like format so it is easy to parse it and retrieve the structure using existing XML tools. OPML output seems a good candidate too.
I think either docbook or OPML would be acceptable. I would write a script that merges the appropriate sections (appends their content from different documents).
However, pandoc
translates all the special characters to HTML codes like "
, &
and so on. What I would like is to be able to extract, say, specific subsections in the hierarchy but have the text (e.g. body of a subsection) exactly as it is in the original md
file. How would you convert the HTML codes ("
etc.) back to ASCII so that everything can be rendered to PDF/DOC/... with pandoc
?
Upvotes: 2
Views: 648
Reputation: 2236
Two other candidates for output formats to check might be
Upvotes: 3
Reputation: 171
Pandoc is written in Haskell and uses a Haskell data structure as an intermediate format for converting documents. You can choose native
as output format to get that data structure. Haskell data structures are very easy to parse. You can also just use Haskell and read
the data structure using the types from the pandoc package, or use the document reader functions from pandoc directly.
Upvotes: 3