Swapnil Chaudhari
Swapnil Chaudhari

Reputation: 1

What is recommended - keeping empty lists/arrays versus Null in spark tables?

I have a large spark table containing mixed data types String,arrays,maps The array and map columns are sparse in nature. Should i keep empty arrays in values for these columns or make them null? Similarly is it recommended to use empty strings "" for storing or null? What is a good practice and advantages and disadvantages of both?

Upvotes: 0

Views: 858

Answers (1)

mpSchrader
mpSchrader

Reputation: 932

Generally speaking I would always try to use NULL values instead of empty strings or arrays. The main reason for me for me his how they are handled in spark, e.g. when joining two data frames. NULL values are ignored in joins, but empty strings or lists are not. This can often result in very skew data, which can heavily slow down your transformations. Some information about skew data can be found here [external link].

In addition, NULL values are also often ignored in functions like coalesce of columns [docs], count in aggregations [related question] or first(col, ignorenulls=True) [docs]. If you want to use the functions as they are intended, I would also recommend using NULL over empty string/list.

To sum this up: using NULL over other values like empty strings or lists, allows you to profit for more native Spark functionality and I would recommend to use NULL when ever possible.

Upvotes: 1

Related Questions