Reputation: 1307
I have a CSV file and I am processing its data.
I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.
I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.
I do not want to replace the null/empty values with other. The only thing I have done is to use these 2 options in CSV:
.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")
Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?
I run something like this:
data.agg(mean("col_name"), stddev("col_name"),count("col_name"),
min("col_name"), max("col_name"))
Otherwise I can consider that it is a problem in my code.
Upvotes: 5
Views: 24442
Reputation: 5771
To add to other answers: Remember the null and NaN are different things to spark:
df_=spark.createDataFrame([(1, np.nan), (None, 2.0),(3,4.0)], ("a", "b"))
df_.show()
| a| b|
+----+---+
| 1|NaN|
|null|2.0|
| 3|4.0|
+----+---+
df_.agg(F.mean("a"),F.mean("b")).collect()
[Row(avg(a)=2.0, avg(b)=nan)]
Upvotes: 1
Reputation: 105
I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results. Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.
Upvotes: 5
Reputation: 63221
Contrary to one of the comments it is not true that nulls are ignored. Here is an approach:
max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))
This will still have an issue if there were only null values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".
Upvotes: 2