Reputation: 1
I have a large spark table containing mixed data types String,arrays,maps The array and map columns are sparse in nature. Should i keep empty arrays in values for these columns or make them null? Similarly is it recommended to use empty strings "" for storing or null? What is a good practice and advantages and disadvantages of both?
Upvotes: 0
Views: 858
Reputation: 932
Generally speaking I would always try to use NULL
values instead of empty strings or arrays. The main reason for me for me his how they are handled in spark, e.g. when joining two data frames. NULL
values are ignored in joins, but empty strings or lists are not. This can often result in very skew data, which can heavily slow down your transformations. Some information about skew data can be found here [external link].
In addition, NULL
values are also often ignored in functions like coalesce
of columns [docs], count
in aggregations [related question] or first(col, ignorenulls=True)
[docs]. If you want to use the functions as they are intended, I would also recommend using NULL
over empty string/list.
To sum this up: using NULL
over other values like empty strings or lists, allows you to profit for more native Spark functionality and I would recommend to use NULL
when ever possible.
Upvotes: 1