Reputation: 31
I am trying to import an unstructured csv from datalake storage to databricks and i want to read the entire content of this file:
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type A A
Date 1/1/2022 B
Time 0:00:00 A
X 1 m OK 1 2 3 B
Y - A
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type B C
Date 1/1/2022 D
Time 0:00:00 C
X 1 m OK 1 2 3 D
Y - C
1. Method 1 : I tried reading the first line a header
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load('abfss://xyz/sample.csv')
2. Method 2: I skipped reading header
3. Method 3: Defined a custom schema
Upvotes: 0
Views: 184
Reputation: 780
If you know the schema ahead of time it should be possible to read the csv file and drop malformed data.
See this as an example:
name_age.csv
Hello
name,age
aj,19
Hello
name,age
test,20
And the code to read this would be:
>>> from pyspark.sql.types import StringType,IntegerType,StructField,StructType
>>> schema=StructType([StructField("name",StringType(),True),StructField("age",IntegerType(),True)])
>>> df=spark.read.csv("name_age.csv",sep=",",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+----+---+
|name|age|
+----+---+
| aj| 19|
|test| 20|
+----+---+
Other helpful link: Remove first and last row from the text file in pyspark
Upvotes: 0