user3222101
user3222101

Reputation: 1330

how to read multiple csv files with different schema in pyspark?

I have different csv files kept in sub folders in a given folder and some of them have one format and some of them have another format in the column names.

april_df = spark.read.option("header", True).option("inferSchema", True).csv('/mnt/range/2018_04_28_00_11_11/')

Above command only refers to one format and ignores other format. Is there any quick way in the parameter like mergeschema for parquet?

format of some files is like:

id ,f_facing ,l_facing ,r_facing ,remark

other is

id, f_f, l_f ,r_f ,remark

but there could be chances in the future that some columns are missing etc so need a robust way to handle this.

Upvotes: 2

Views: 3722

Answers (1)

Rob
Rob

Reputation: 478

It is not. Either the column should be filled with null in the pipeline or you will have to specify the schema before you import the file. But if you have an understanding of what columns might be missing in the future, you could possibly create a scenario where based on length of the df.columns, you specify the schema, although it seems tedious.

Upvotes: 1

Related Questions