Tarique
Tarique

Reputation: 711

Inferring Pyspark schema

When a csv file(or any other) is read into a DataFrame with inferSchema as true, do all the rows of a particular column parsed to infer the schema or just a sample of those? Example:

df = (spark.read.format(file_type)
                          .option("inferSchema", infer_schema)
                          .option("header", first_row_is_header)
                          .option("sep", delimiter)
                          .load(file_path))

Upvotes: 2

Views: 3349

Answers (2)

Sumanta
Sumanta

Reputation: 37

The "samplingRatio" is an intelligent way to avoid scanning all the rows for the schema infer.

Syntax to do the same while reading a csv

df = (
spark.read
.option('header', True)
.option("samplingRatio", 0.001)
.csv(csv_path, inferSchema=True)

)

Upvotes: 0

o_O
o_O

Reputation: 429

As per Spark documentation for inferSchema (default=false):

Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.

We can use the option samplingRatio (default=1.0) to avoid going through all the data for inferring the schema:

Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option.

Upvotes: 3

Related Questions