Why this Apache Beam pipeline reading an Excel file and creating a .CSV from it is not working?

Question

I am pretty new in Apache Beam and I am experiencing the following problem with this simple task: I am trying to create a new .csv file staring from an .xlsx Excel file. To do this I am using Apache Beam with Python 3 language and Pandas library. I admit that are all topic pretty new to me.

I am working on Google Colab but I think that this is not so significant as information.

I imported Apache Beam and Pandas in this way (the ! is only the way to give shell command to Google Colab):

!{'pip install --quiet apache-beam pandas'}

And this is my Python code implementing the Apache Bean pipeline:

import apache_beam as beam
import pandas as pd

def parse_excel(line):
  # Use the pandas library to parse the line into a DataFrame
  df = pd.read_excel(line)
  print("DATAFRAME")

  # Convert the DataFrame to a list of dictionaries, where each dictionary represents a row in the DataFrame
  # and has keys that are the column names and values that are the cell values
  return [row.to_dict() for _, row in df.iterrows()]

def print_json(json_object):
  # Print the JSON object
  print(json_object)

def run(argv=None):
  print("START run()")
  p = beam.Pipeline()

  # Read the Excel file as a PCollection
  lines = (
             p 
             | 'Read the Excel file' >> beam.io.ReadFromText('Pazienti_export_reduced.xlsx')
             | "Convert to pandas DataFrame" >> beam.Map(lambda x: pd.DataFrame(x))
             | "Write to CSV" >> beam.io.WriteToText(
                'data/csvOutput', file_name_suffix=".csv", header=True
            )
          )
  
  print("after lines pipeline")

  # Parse the lines using the pandas library
  #json_objects = lines | 'ParseExcel' >> beam.Map(parse_excel)

  # Print the values of the json_objects PCollection
  #json_objects | 'PrintJSON' >> beam.ParDo(print_json)
  

if __name__ == '__main__':
  print("START main()")
  print(beam.__version__)
  print(pd.__version__)
  run()

When I run it I obtain no error but my data folder still empty. Basically it seems that the expected csvOutput.csv output file was not created at the end of my pipeline.

What is wrong? What am I missing? How can I try to fix my code?

Why this Apache Beam pipeline reading an Excel file and creating a .CSV from it is not working?

Answers (1)

Related Questions