Reputation: 2904
I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:
spark= SparkSession.builder.getOrCreate()
from pyspark.sql.types import StringType, IntegerType,
StructType, StructField
rdd = sc.textFile('./some csv_to_play_around.csv'
schema = StructType([StructField('Name', StringType(), True),
StructField('DateTime', TimestampType(), True)
StructField('Age', IntegerType(), True)])
# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)
My question is, what does the True
stand for in the schema
list above? I can't seem to find it in the documentation. Thanks in advance
Upvotes: 32
Views: 72929
Reputation: 2181
You can also use a datatype string:
schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'
There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes
Upvotes: 7
Reputation: 6693
It means if the column allows null values, true
for nullable, and false
for not nullable
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
Refer to Spark SQL and DataFrame Guide for more informations.
Upvotes: 29