Reputation: 45
This is basically a follow-up question based on the (working) solution given here:
Output .docx document using repository in palantir foundry
(generates word files in a foundry repository and writes docx files in a spark df)
What I did manage is to write data from other data sources (dfs) into the document; what I did not manage is to get an image into the document using doc.add_picture(). I think theoretically it should work, since even if Foundry is usually used for tabular data analysis, image processing is (afaik) also supported; but I could not find a proper way to get the image as a repository input that also fits the doc.add_picture() function.
So in general (I know that it is not the fundamental idea of Foundry to allow the creation of well-formatted docx files) the python-docx library has some weaknesses when it comes to formatting. So, looking on a meta level one could think of unzipping a template docx file that contains:
._rels
.docProps
.word
.[Content_Types]
Now the question would be, if I put the unzipped docx file into Foundry and for each row-iteration (like in the previous question) I only update the Document.xml (that is in the .word folder) with the according content from a given df (its rows), is Foundry then able to zip the unzipped docx folders/files with the updated xml into new docx files that are like in the previous question saved in a pyspark df?
Upvotes: 2
Views: 453
Reputation: 37177
Disclaimer: I have not tried this, but it sounds like it could work. Make sure your driver has enough memory to run this.
You could for example use a tempfile https://docs.python.org/3.6/library/tempfile.html to write your docs into, afterwards you can read the binary from it and use it into your zips:
import tempfile
binaryDocs = {}
with tempfile.TemporaryFile() as worddoc:
doc = docx.Document()
doc.add_heading(row['name'])
doc.add_paragraph(row['content'])
doc.save(worddoc)
# https://docs.python.org/3.6/library/tempfile.html#examples
binaryDocs[row['name']] = wordoc.read()
then use the same strategy to write a file to a dataset but using zip for example with this example stolen from Python in-memory zip library which you'll need to replace within the generate_files
from your original function..
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
with transform_output.filesystem().open(filename, 'wb') as f:
f.write(zip_buffer.getvalue())
Upvotes: 1