Reputation: 61

Broken cross-document links with pandoc when converting markdown to other formats

Wenn converting markdown files with cross document links to html, docs or pdf the links get broken in the process. I use pandoc 1.19.1 and MikTex. This is my testcase:

File1: doc1.md
[link1](/doc2.md)
File2: doc2.md
[link2](/doc1.md)

The result in html with this call to pandoc: pandoc doc1.md doc2.md -o test.html looks like this:

<p><a href="/doc2.md">link1</a> <a href="/doc1.md">link2</a></p>

As pdf a link is created but it does not work. Exported as docx it looks the same.

I would have asumed that when multiple files are processed and concatenated into the same output file, then the result should contain page internal links like anchor links for html-output. But instead the link it created in the output file like it was in the input files. Even the original file extension .md is preserved in the created links. What am I doing wrong ?

My problem looks a bit like this: pandoc command line parameters for resolving internal links In the comments of this question the bug is said to be fixed by a pull request in May. But the bug still seems to exist. Greetings Georg

Upvotes: 6

Answers (2)

Saaru Lindestøkke

Reputation: 2564

I had a similar problem when trying to export a Gitlab wiki to PDF. There links between pages look like filename-of-page#anchor-name and links within a page look like #anchor-name. I wrote a (finicky and fragile) pandoc filter that solved that problem for me, who knows it's useful to others.

Example files

To explain my solution I'll have two test files, 101-first-page.md:

# First page // Gitlab automatically creates an anchor here named #first-page

Some text.

## Another section // Gitlab automatically creates an anchor here named #another-section

A link to the [first section](#first-page)

and 102-second-page.md:

# Second page // Gitlab automatically creates an anchor here named #second-page

Some text and [a link to the first page](101-first-page#first-page).

When concatenating them to render as one document in pandoc, links between pages break as anchors change. Below the concatenated file with the anchors in comments.

# First page // anchor=#first-page

Some text.

## Another section anchor=#another-section

A link to the [first section](#first-page)

# Second page //  anchor=#second-page

Some text and [a link to the first page](101-first-page#first-page). // <-- this anchor no longer exists.

The link from the second to the first page breaks as the link target is incorrect.

Solution

By pre-processing all markdown files first individually via a pandoc filter, and then concatenating the resulting json files I was able to get all links working.

Requirements

pandoc
latex
python
pandocfilters
Every file should start with a level 1 header that matches the filename (except for the number at the beginning). E.g. the file 101-A file on the wiki.md should have a first level one header named A file on the wiki.

Filter

The filter itself (together with the pandoc script) is available in this gist.

What it does is:

It gets the label of the first level 1 header, e.g. first-page
It prepends that label to all other labels in the same file, e.g. first-page-another-section.
It renames all links to the same file such that the prefix is taken into account, e.g. #first-page-first-page
It renames all links to other files such that the (assumed) prefix of the other files is taken into account, e.g. 101-first-page#first-page becomes #first-page-first-page.

After it has run every markdown file through this filter individually and converted them to json files, it concatenates the json's and converts that to a PDF.

Upvotes: 3

mb21

Reputation: 39488

As the pandoc README states:

If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing.

So for the parsing done by pandoc, it sees it as one document... so you'll have to construct your links in multiple files as if it they were all in one file, see also this answer for details.