5ilver4rrow
5ilver4rrow

Reputation: 45

Import Images and zip .docx files with repository in Palantir Foundry

This is basically a follow-up question based on the (working) solution given here: Output .docx document using repository in palantir foundry
(generates word files in a foundry repository and writes docx files in a spark df)

What I did manage is to write data from other data sources (dfs) into the document; what I did not manage is to get an image into the document using doc.add_picture(). I think theoretically it should work, since even if Foundry is usually used for tabular data analysis, image processing is (afaik) also supported; but I could not find a proper way to get the image as a repository input that also fits the doc.add_picture() function.


So in general (I know that it is not the fundamental idea of Foundry to allow the creation of well-formatted docx files) the python-docx library has some weaknesses when it comes to formatting. So, looking on a meta level one could think of unzipping a template docx file that contains:

._rels
.docProps
.word
.[Content_Types]

Now the question would be, if I put the unzipped docx file into Foundry and for each row-iteration (like in the previous question) I only update the Document.xml (that is in the .word folder) with the according content from a given df (its rows), is Foundry then able to zip the unzipped docx folders/files with the updated xml into new docx files that are like in the previous question saved in a pyspark df?

Upvotes: 2

Views: 453

Answers (1)

fmsf
fmsf

Reputation: 37177

Disclaimer: I have not tried this, but it sounds like it could work. Make sure your driver has enough memory to run this.

You could for example use a tempfile https://docs.python.org/3.6/library/tempfile.html to write your docs into, afterwards you can read the binary from it and use it into your zips:

import tempfile

binaryDocs = {}

with tempfile.TemporaryFile() as worddoc:
   doc = docx.Document()
   doc.add_heading(row['name'])
   doc.add_paragraph(row['content'])
   doc.save(worddoc)

   # https://docs.python.org/3.6/library/tempfile.html#examples
   binaryDocs[row['name']] = wordoc.read()

then use the same strategy to write a file to a dataset but using zip for example with this example stolen from Python in-memory zip library which you'll need to replace within the generate_files from your original function..

import io
import zipfile

zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
    for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
        zip_file.writestr(file_name, data.getvalue())

with transform_output.filesystem().open(filename, 'wb') as f:
    f.write(zip_buffer.getvalue())

Upvotes: 1

Related Questions