Asher
Asher

Reputation: 193

is it possible to generate pdf from datasets and save to foundry incrementally

FPDF is a library that allows to convert a pandas dataframe to nicely formatted pdf reports. Is there a feature in foundry code repo or code workbook to write pdf files into foundry from a spark or pandas dataframe ?

i have a requirement to create a nicely formatted pdf report from a foundry dataset filtered to few rows.

with the help of user https://stackoverflow.com/users/4922673/jackfischer i was able to get the requirement working, However the code overwrites the existing the file, how to incrementally update the datasets with new files everytime the code is ran. I am using Code Workbook templating feature to pass parameter to the logic and everytime a new parameter is passed, how can the logic create new file

example :

  1. samplefile.txt
  2. samplefile2.txt

Upvotes: 3

Views: 1601

Answers (1)

jackfischer
jackfischer

Reputation: 156

While I'm not familiar with the FPDF library specifically, Foundry supports generating files from datasets in transforms or Code Workbooks.

To create a single Pandas-based PDF from your dataset, convert your dataset to Pandas and get an output file handle from Foundry such as. In Code Workbooks,

def pdf_dataset(input_df):
    output = Transforms.get_output()
    pd = input_df.toPandas()
    output_fs = output.filesystem()
        with output_fs.open(output_file_path, "wb") as f:
            # use FDPF as needed

Alternatively, you can create a PDF per-row in parallel via Spark. This can be done most easily by structuring your data such that the parameters needed to generate each PDF are colocated in rows and from there you can run a Python function on to generate the PDF and write it out of Python memory to the destination dataset.

In a Code Workbook this would resemble

def pdf_dataset(input_df):
    output = Transforms.get_output()

    def generate_pdf(row):
        output_fs = output.filesystem()
        with output_fs.open(output_file_path, "wb") as f:
            # use FDPF as needed
            
    input_df.rdd.foreach(generate_pdf)

Upvotes: 3

Related Questions