Inferring Pyspark schema

Question

When a csv file(or any other) is read into a DataFrame with inferSchema as true, do all the rows of a particular column parsed to infer the schema or just a sample of those? Example:

df = (spark.read.format(file_type)
                          .option("inferSchema", infer_schema)
                          .option("header", first_row_is_header)
                          .option("sep", delimiter)
                          .load(file_path))

o_O · Accepted Answer

As per Spark documentation for inferSchema (default=false):

Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.

We can use the option samplingRatio (default=1.0) to avoid going through all the data for inferring the schema:

Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option.

Inferring Pyspark schema

Answers (2)

Related Questions