Reputation: 758
I'm having trouble understanding the following phenomenon: in Spark 2.2, on Scala, I witness a significant incease in the persisted DataFrame size after replacing literal empty string values with lit(null).
This is the function I use to replace empty string values:
def nullifyEmptyStrings(df:DataFrame): DataFrame = {
var in = df
for (e <- df.columns) {
in = in.withColumn(e, when(length(col(e))===0, lit(null:String)).otherwise(col(e)))
}
in
}
I observe that the persisted (DISK_ONLY) size of my initial dataframe before running this function is 1480MB, and afterwards is 1610MB. The number of partitions remains unchanged.
Any thoughts? The nulling works fine by the way, but my main reason for introducing this was to reduce shuffle size, and it seems I only increase it this way.
Upvotes: 3
Views: 2226
Reputation: 1
same results here. Perhaps should also check number of partitions as huge partitions with many distinct values may store columns as row strings as opposed to a dictionary.
Upvotes: -2
Reputation: 758
I'm going to answer this myself, as we have now done some investigation that might be useful to share.
Testing on large (10s of millions of rows) DataFrames with entirely String columns, we observe that replacing empty Strings with nulls results in a slight decrease of the overall disk footprint when serialized to parquet on S3 (1.1-1.5%).
However, dataframes cached either MEMORY_ONLY or DISK_ONLY were 6% and 8% larger respectively. I can only speculate how Spark is internally representing the NULL value when the Column is of StringType ... but whatever it is, its bigger than an empty string. If there's any way to inspect this I'll be glad to hear it.
The phenomenon is identical in PySpark and Scala.
Our goal in using nulls was to reduce shuffle size in a complex join action. Overall, we experienced the opposite. However we'll keep using nulls because the automatic pushdown of isNotNull filters makes writing joins much cleaner in Spark SQL.
Upvotes: 3