Oleg Shirokikh
Oleg Shirokikh

Reputation: 3565

Spark-csv data source: infer data types

I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames.

Everything works but all columns are assumed to be of StringType.

As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), for built-in sources such as JSON, the schema with data types can be inferred automatically.

Can the types of columns in CSV file be inferred automatically?

Upvotes: 7

Views: 7359

Answers (2)

Olga
Olga

Reputation: 91

Starting from Spark 2 we can use option 'inferSchema' like this: getSparkSession().read().option("inferSchema", "true").csv("YOUR_CSV_PATH")

Upvotes: 7

dpeacock
dpeacock

Reputation: 2747

Unfortunately this is not currently supported but it would be a very useful feature. Currently they must be declared in DLL. From the documentation we have:

header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.

which is what you are seeing.

Note that it is possible to infer schema at query time, e.g.

select sum(mystringfield) from mytable

Upvotes: 3

Related Questions