In Azure databricks writing pyspark dataframe to eventhub is taking too long as there3 Million records in dataframe

Question

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.

Below is my pyspark code to connect and read oracle db table as dataframe

df = spark.read \
            .format("jdbc") \
            .option("url", databaseurl) \
            .option("query","select * from tablename") \
            .option("user", loginusername) \
            .option("password", password) \
            .option("driver", "oracle.jdbc.driver.OracleDriver") \
            .option("oracle.jdbc.timezoneAsRegion", "false") \
            .load()

then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.

I have defined ehconf and eventhub connection string. Below is my write to eventhub code

df.select("body") \
   .write\
   .format("eventhubs") \
   .options(**ehconf) \    
   .save()

my pyspark code is taking 8 hours to send 3 million records to eventhub.

Could you please suggest how to write pyspark dataframe to eventhub faster ?

My Eventhub is created under eventhub cluster which has 1 CU in capacity

Databricks cluster config : mode: Standard runtime: 10.3 worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5) driver type: Standard_D16as_v4 64GB Memory,16 cores

In Azure databricks writing pyspark dataframe to eventhub is taking too long as there3 Million records in dataframe

Answers (1)

Related Questions