SoakingHummer
SoakingHummer

Reputation: 572

Spark: subset a few columns and remove null rows

I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this

x      y       z
------------------
1      a       d1
Null   v       ed
5      Null    Null
7      s       Null
Null   bd      Null

I want to create a new spark dataset with only x and y columns from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this

x      y
--------
1      a
7      s

The following is the code:

val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])

As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)

Upvotes: 1

Views: 4143

Answers (4)

mohammad hussien
mohammad hussien

Reputation: 11

Apply "any" in na.drop:

df = df.select("x", "y")
       .na.drop("any", Seq("x", "y"))

Upvotes: 1

Brad
Brad

Reputation: 15879

Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.

In your example above you just need to call the following on a DataSet that you load

val noNullValues = load_DF.na.drop("all", Seq("x", "y"))

This will drop records where nulls occur in either field x or y but not z. You can read up on DataFrameNaFunctions for further options to fill in data, or translate values if required.

Upvotes: 2

Glennie Helles Sindholt
Glennie Helles Sindholt

Reputation: 13154

You are simply applying a function (in this case isNotNull) to the values when you do a select - instead you need to replace select with filter.

val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)

or if you prefer:

val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)

Upvotes: 0

KOUSIK MANDAL
KOUSIK MANDAL

Reputation: 2052

Try this;

val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()

Now;

load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")

Upvotes: 2

Related Questions