Reputation: 711
When a csv file(or any other) is read into a DataFrame with inferSchema as true, do all the rows of a particular column parsed to infer the schema or just a sample of those? Example:
df = (spark.read.format(file_type)
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file_path))
Upvotes: 2
Views: 3349
Reputation: 37
The "samplingRatio" is an intelligent way to avoid scanning all the rows for the schema infer.
Syntax to do the same while reading a csv
df = (
spark.read
.option('header', True)
.option("samplingRatio", 0.001)
.csv(csv_path, inferSchema=True)
)
Upvotes: 0
Reputation: 429
As per Spark documentation for inferSchema (default=false)
:
Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.
We can use the option samplingRatio (default=1.0)
to avoid going through all the data for inferring the schema:
Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option.
Upvotes: 3