Reputation: 2697

Spark & Scala - Cannot Filter null Values from RDD

i tried to filter null values from RDD but failed. Here's my code :

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

val raw_hbaserdd = hBaseRDD.map{
  kv => kv._2
}

val Ratings = raw_hbaseRDD.map {
      result =>  val x = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("user")))
                 val y = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("item")))
                 val z = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("rating")))

                 (x,y, z)
    }
Ratings.filter ( x => x._1 != null )

Ratings.foreach(println)

when Debugging, null value still appeared after Filter :

(3359,1494,4)
(null,null,null)
(28574,1542,5)
(null,null,null)
(12062,1219,5)
(14068,1459,3)

any Better idea ?

Upvotes: 3

Answers (3)

V Jaiswal

Reputation: 51

Try the below:

Ratings.filter ( x => x._1 != "")

Similar example here at Filter rdd lines by values in fields Scala

Upvotes: 0

Rakshith

Reputation: 664

Ratings.filter ( x => x._1 != null )

this actually transforms the RDD but you are not using that particular RDD. U can try

Ratings.filter(_._1 !=null).foreach(println)

Upvotes: 5

Tzach Zohar

Reputation: 37852

RDDs are immutable objects - any transformation on an RDD doesn't change that original RDD, but rather produces a new one. So - you should use the RDD returned from filter (just like you do with the result of map) if you want to see the effect of filter:

val result = Ratings.filter ( x => x._1 != null )
result.foreach(println)

Upvotes: 5

Spark &amp; Scala - Cannot Filter null Values from RDD

Answers (3)

Related Questions

Spark & Scala - Cannot Filter null Values from RDD