ASH
ASH

Reputation: 20302

Pyspark Not Picking Up Custom Schema

I'm testing this code.

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)


customSchema = StructType([ \
StructField("id", StringType(), True), \
StructField("date", StringType(), True), \
etc., etc., etc.
StructField("filename", StringType(), True)])



fullPath = "path_and_credentials_here"
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema, delimiter='|').load(fullPath).withColumn("filename",input_file_name())

df.show()

Now, my data is pipe-delimited, and the first row has some metadata, which is also pipe-delimited. The strange thing is that the custom schema is actually being ignored. Instead of my custom schema being applied, the metadata in the first row of the file is controlling the schema, and this is totally wrong. Here is the view that I see.

+------------------+----------+------------+---------+--------------------+
|               _c0|       _c1|         _c2|      _c3|            filename|
+------------------+----------+------------+---------+--------------------+
|                CP|  20190628|    22:41:58|   001586|   abfss://rawdat...|
|          asset_id|price_date|price_source|bid_value|   abfss://rawdat...|
|             2e58f|  20190628|         CPN|  108.375|   abfss://rawdat...|
|             2e58f|  20190628|         FNR|     null|   abfss://rawdat...|

etc., etc., etc.

How can I get the custom schema applied?

Upvotes: 0

Views: 1042

Answers (1)

Oliver W.
Oliver W.

Reputation: 13459

The problem you're experiencing is because you're using the older (and no longer maintained) CSV reader. See the disclaimer note right under the title of the package.

If you try the new format, it works:

In [33]: !cat /tmp/data.csv
CP|12|12:13
a|b|c
10|12|13

In [34]: spark.read.csv(fullPath, header='false', schema = customSchema, sep='|').show()
+----+---+-----+
|name|foo|  bar|
+----+---+-----+
|  CP| 12|12:13|
|   a|  b|    c|
|  10| 12|   13|
+----+---+-----+

Upvotes: 1

Related Questions