HISI
HISI

Reputation: 4797

Should I set nullable to false or true?

I have a dataframe in spark and I don't understand what the nullable property means, Should I set it to false or keep true :

for example:

root
 |-- user_id: long (nullable = true)
 |-- event_id: long (nullable = true)
 |-- invited: integer (nullable = true)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = true)
 |-- friend_id: long (nullable = true)

Upvotes: 5

Views: 15613

Answers (1)

mahmoud mehdi
mahmoud mehdi

Reputation: 1590

Nullable indicates if the concerned column can be null or not. It ensures that a specific column can't be null (if it's null while the nullable property is set to false, Spark will launch a java.lang.RuntimeException during the first action on the dataframe).

Here's an example where we set the first row's value to null while the nullable property of this column is set to false :

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val data = Seq(
  Row(null, "a"),
  Row(5, "z")
)

val schema = StructType(
 List(
   StructField("num", IntegerType, false),
   StructField("letter", StringType, true)
 )
)

val df = spark.createDataFrame(
 spark.sparkContext.parallelize(data),
 schema
)
df.show()

You'll then have the following exception, saying that the column num can't be having null values :

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'num' of input row cannot be null.

PS : the nullable value is set to true by default, you don't have to set it, unless you want it to be false.

https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala#L39

I hope it helps

Upvotes: 5

Related Questions