Reputation: 3565
I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv
files into Spark DataFrames
.
Everything works but all columns are assumed to be of StringType
.
As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), for built-in sources such as JSON, the schema with data types can be inferred automatically.
Can the types of columns in CSV file be inferred automatically?
Upvotes: 7
Views: 7359
Reputation: 91
Starting from Spark 2 we can use option 'inferSchema' like this: getSparkSession().read().option("inferSchema", "true").csv("YOUR_CSV_PATH")
Upvotes: 7
Reputation: 2747
Unfortunately this is not currently supported but it would be a very useful feature. Currently they must be declared in DLL. From the documentation we have:
header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.
which is what you are seeing.
Note that it is possible to infer schema at query time, e.g.
select sum(mystringfield) from mytable
Upvotes: 3