Reputation: 572
I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this
x y z
------------------
1 a d1
Null v ed
5 Null Null
7 s Null
Null bd Null
I want to create a new spark dataset with only x and y columns
from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this
x y
--------
1 a
7 s
The following is the code:
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])
As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)
Upvotes: 1
Views: 4143
Reputation: 11
Apply "any" in na.drop:
df = df.select("x", "y")
.na.drop("any", Seq("x", "y"))
Upvotes: 1
Reputation: 15879
Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.
In your example above you just need to call the following on a DataSet that you load
val noNullValues = load_DF.na.drop("all", Seq("x", "y"))
This will drop records where nulls occur in either field x
or y
but not z
. You can read up on DataFrameNaFunctions
for further options to fill in data, or translate values if required.
Upvotes: 2
Reputation: 13154
You are simply applying a function (in this case isNotNull
) to the values when you do a select
- instead you need to replace select
with filter
.
val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)
or if you prefer:
val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)
Upvotes: 0
Reputation: 2052
Try this;
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
Now;
load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")
Upvotes: 2