Pandoc - Images in Word file are not extracted into media folder at the time of the filter execution

Question

I've got some MS Word files(docx), and I convert them into markdown files. And later, those markdown files get converted into PDF and HTML files. All of the conversions are made with the help of pandoc.

When the word file is getting converted into Markdown, my python pandoc filter needs to get the width and height information(in inches) of the image from the AST file. This is working fine I'm able to get this information from AST.

{
    "t": "Image",
    "c": [
    [
        "",
        [],
        [
        ["width", "5.113165354330708in"],
        ["height", "3.063299212598425in"]
        ]
    ],
    [],
    ["media/image1.png", ""]
    ]
}

But also it needs to get the actual image using pillow library and get the image size(in pixels) and DPI information from the file system for some calculations.

But the problem is, when I try to create this markdown image link in my pandoc filter that I use when converting docx to markdown, when I get the image with the python package pillow, it says

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/mertcan.segmen/Desktop/doc/media/image1.png'

Which probably means that pandoc does not extract the images from Word file before executing the pandoc filter. Is this normal? If not, any advice on how I can achieve what I have in mind?

Mertcan Seğmen · Accepted Answer

I found some sort of a workaround, I'm running pandoc --extract-media MyDocxFile.docx ./ right before converting my docx to markdown. This only extracts images from docx file into the media folder and then I run my pandoc command for the conversion. Since the images were extracted before, my filter has access to them.

Pandoc - Images in Word file are not extracted into media folder at the time of the filter execution

Answers (1)

Related Questions