Read multiple groups of csv files from a folder and insert to respective target tables parallelly using spark or databricks

Question

Input: abc.tar.gz -> un tar - > Folder: abc

Folder structure of abc:

root folder: abc contains csv files that are generated from 100 cities every 5 minutes in day.

Number of csv files: 100 cities * 12 files per hour * 24 hours = 28800 csv files

abc/    
city1_0005.csv
city1_0010.csv
..
city1_2355.csv
..
..
city2_0005.csv
city2_0010.csv
..
city2_2355.csv
..
..
city100_0005.csv
city100_0010.csv

Functional Requirement:

Using spark/ databricks Create table for each city and load the respective city csv files (288) into table. total 100 tables will be there in target location. 1 table for 1 city.
Each city has different schema .All columns are different for each city. so I cannot write all cities data into single table with partitions.

technical requirement: Read and process the files parallelly for better performance

I have developed below code to process data sequentially. I am looking for ways to optimize it .

staging_path="abfss://xyz/abc"

#using databricks utils  to get the list of files in folder
filesProp = dbutils.fs.ls(staging_adls_path)

#extracting the city names from list of filenames
filesSet  =set()
for file in filesProp:
    filesSet.add(file.name.split('-')[0])

#empty list to store dataframes
dictionary_df = {} 

#reading 1 city data and inserting to table
for fileName in filesSet:
    filePath = staging_path+fileName+"*"
    print(filePath)
    dictionary_df[fileName] = spark.read.options(header='True', delimiter=',').csv(filePath)
    dictionary_df[fileName].write.saveAsTable(fileName)

Read multiple groups of csv files from a folder and insert to respective target tables parallelly using spark or databricks

Answers (1)

Related Questions