What to do with odd span tags when using Pandoc to make markdown output

Question

I have some Calibre created epubs that I want to make into markdown to use in Obsidian. I found Pandoc and my simple attempts at conversion are, among other things, losing the italics and passing the Calibre span tags through, which don't show as italics in Obsidian.

If I turn off the raw_html extension it doesn't pass all the span tags through, but I don't get any italics, either. What I want to do is convert the html:

Some Words

into italic text in my final markdown file. If Pandoc can do this, that would be great. Otherwise I'll take a swipe at converting the html before passing it into Pandoc, but a lot of the span tags that Calibre generated are stacked a few layers deep, so a really simple solution would be great.

Does Pandoc handle this directly or do I need to deal with the html first? I'm not just concerned with italics only, there are a bunch of other formatting issues that use variouos Calibre span tags that could be simpler, like bold and some headings. So I'm trying to work out a way to deal with them all.

UPDATE:

Since I had to do a lot of poking around the web and trial and error to get this working well enough for my needs, I thought it would be useful to post my first try at a lua filter. This has worked well on a few hundred html pages extracted (via unzip) from EPUBs seemingly authored by a tool called Calibre.

My ebooks are wordy references that are light on images, so my Image handler is simple enough for my needs. And I can grep the resulting files and fix any link issues. My set of Calibre EPUBs changed their internal structure depending on when they were created, so I keep adding new class names since there seemed to be no effort to maintain the same naming over time. Therefore I expect to modify this over time as I convert more of the files. I also expect markdown will be a better format going forward than EPUBs!

This is my first ever lua code so I expect it is a bit sloppy. Also, I used --wrap=none -t commonmark in a bash script that iterates a pandoc tranformation over all the files in my work directory.

-- Corrections for some Calibre oddities when using Pandoc to convert to markdown for Obsidian.
-- Note that I converting very old ebooks and that I don't know anything about Calibre.
function Span (span)
    -- Make italic for: (“Hello there!”)
    if span.classes:includes 'italic' then
        return pandoc.Emph(span.content)
    end

    -- Make bold for: (“Hello there!”)
    if span.classes:includes 'bold' then
        return pandoc.Strong(span.content)
    end

    -- Unclear what purpose these serve...
    if span.classes:includes 'calibre1' or span.classes:includes 'calibre2'
            or span.classes:includes 'calibre3' or span.classes:includes 'calibre4' then
        return pandoc.Strong(span.content)
    end

    -- My markdown reader (Obsidian) works with this when using commonmark output.
    if span.classes:includes 'underline' then
        span.attributes['style'] = 'text-decoration: double underline ;'
        return span
    end
end

function Image (img)
    -- Fix calibre6 images.
    if img.classes:includes 'calibre6' or img.classes:includes 'calibre9' then
        return pandoc.Image(img.caption, img.src, nil, nil)
    end
end

function Div (div)
    -- Put a horizontal line in for the page break, just to see where they are.
    if div.classes:includes 'mbp_pagebreak' then
        return '---'
    end
    -- These seem to be hardcoded page delimiters put in by calibre for ebook readers?
    if div.classes:includes 'calibre_4' or div.classes:includes 'calibre_13' then
        return '---'
    end
end

Finally, I should note that I wrote a script that renames all of the new markdown files by their first line. This put them in a coherent shape for adding to Obsidian, which uses the filesystem for organizing things. After a little editing and renaming I have a section of old reference books in my Obsidian vault that are easy to access on all my devices.

tarleb · Accepted Answer

Pandoc does not parse CSS and hence has no way to know that this should be put into italics. A good solution is to modify pandoc's internal document representation using a Lua filter.

function Span (span)
  if span.classes:includes 'italic' then
    return pandoc.Emph(span.content)
  end
end

This filter checks if the span has class italic and, if it does, converts it into emphasized text, which will usually be output in italics. Use the filter by saving it to a file and pass that file pandoc via the --lua-filter command line option.

You'll likely want to handle more classes; other pandoc constructors you might want to use are pandoc.Strong and pandoc.Underline, etc.. Run pandoc with --to=native to see how pandoc represents the document internally.

What to do with odd span tags when using Pandoc to make markdown output

Answers (2)

Related Questions