Reputation: 866
I have some Calibre created epubs that I want to make into markdown to use in Obsidian. I found Pandoc and my simple attempts at conversion are, among other things, losing the italics and passing the Calibre span
tags through, which don't show as italics in Obsidian.
If I turn off the raw_html
extension it doesn't pass all the span
tags through, but I don't get any italics, either. What I want to do is convert the html:
<span class="italic">Some Words</span>
into italic text in my final markdown file. If Pandoc can do this, that would be great. Otherwise I'll take a swipe at converting the html before passing it into Pandoc, but a lot of the span
tags that Calibre generated are stacked a few layers deep, so a really simple solution would be great.
Does Pandoc handle this directly or do I need to deal with the html first? I'm not just concerned with italics only, there are a bunch of other formatting issues that use variouos Calibre span
tags that could be simpler, like bold and some headings. So I'm trying to work out a way to deal with them all.
UPDATE:
Since I had to do a lot of poking around the web and trial and error to get this working well enough for my needs, I thought it would be useful to post my first try at a lua filter. This has worked well on a few hundred html pages extracted (via unzip) from EPUBs seemingly authored by a tool called Calibre.
My ebooks are wordy references that are light on images, so my Image handler is simple enough for my needs. And I can grep the resulting files and fix any link issues. My set of Calibre EPUBs changed their internal structure depending on when they were created, so I keep adding new class names since there seemed to be no effort to maintain the same naming over time. Therefore I expect to modify this over time as I convert more of the files. I also expect markdown will be a better format going forward than EPUBs!
This is my first ever lua code so I expect it is a bit sloppy. Also, I used --wrap=none
-t commonmark
in a bash script that iterates a pandoc tranformation over all the files in my work directory.
-- Corrections for some Calibre oddities when using Pandoc to convert to markdown for Obsidian.
-- Note that I converting very old ebooks and that I don't know anything about Calibre.
function Span (span)
-- Make italic for: <span class="italic">(“Hello there!”)</span>
if span.classes:includes 'italic' then
return pandoc.Emph(span.content)
end
-- Make bold for: <span class="bold">(“Hello there!”)</span>
if span.classes:includes 'bold' then
return pandoc.Strong(span.content)
end
-- Unclear what purpose these serve...
if span.classes:includes 'calibre1' or span.classes:includes 'calibre2'
or span.classes:includes 'calibre3' or span.classes:includes 'calibre4' then
return pandoc.Strong(span.content)
end
-- My markdown reader (Obsidian) works with this when using commonmark output.
if span.classes:includes 'underline' then
span.attributes['style'] = 'text-decoration: double underline ;'
return span
end
end
function Image (img)
-- Fix calibre6 images.
if img.classes:includes 'calibre6' or img.classes:includes 'calibre9' then
return pandoc.Image(img.caption, img.src, nil, nil)
end
end
function Div (div)
-- Put a horizontal line in for the page break, just to see where they are.
if div.classes:includes 'mbp_pagebreak' then
return '---'
end
-- These seem to be hardcoded page delimiters put in by calibre for ebook readers?
if div.classes:includes 'calibre_4' or div.classes:includes 'calibre_13' then
return '---'
end
end
Finally, I should note that I wrote a script that renames all of the new markdown files by their first line. This put them in a coherent shape for adding to Obsidian, which uses the filesystem for organizing things. After a little editing and renaming I have a section of old reference books in my Obsidian vault that are easy to access on all my devices.
Upvotes: 2
Views: 567
Reputation: 38441
I don't know Pandoc, but that is some bad HTML. And using better HTML may help you with your problem.
HTML should express the semantic meaning of the content, and
<span class="italic">Some Words</span>
doesn't express any semantic meaning, which is probably by Pandoc doesn't know what to do with it.
For one class names should express why something is formatted the way it is, not how it is formatted. For example, better class names could be important
or book-title
(because book titles are often formatted italic).
Furthermore the element (tag) span
also doesn't express any meaning. But there is an element that (basically) means "important": <em>
. So instead of <span class="important">Some words</span>
it would be better to use <em>Some Words</em>
.
Coming back to Pandoc: If the reason you are using italic, is because the text is important, then you should use <em>
and because (EDIT) is normally rendered as italic, then Pandoc may actually know to use italic, too. (based on that other answer) Pandoc has the concept of emphasis (pandoc.Empf
), so I'm quite sure it will be rendered either as italic or at least something else suitable.
There are more elements that are normally rendered as italic, so you could use those, too, if they are semantic correct for your usage, for example, <cite>
(which can used for book titles, as in my other example) or the more generic <i>
.
Upvotes: 1
Reputation: 22659
Pandoc does not parse CSS and hence has no way to know that this should be put into italics. A good solution is to modify pandoc's internal document representation using a Lua filter.
function Span (span)
if span.classes:includes 'italic' then
return pandoc.Emph(span.content)
end
end
This filter checks if the span has class italic
and, if it does, converts it into emphasized text, which will usually be output in italics. Use the filter by saving it to a file and pass that file pandoc via the --lua-filter
command line option.
You'll likely want to handle more classes; other pandoc constructors you might want to use are pandoc.Strong
and pandoc.Underline
, etc.. Run pandoc with --to=native
to see how pandoc represents the document internally.
Upvotes: 1