Reputation: 553
I'd like to calculate the character 'a' in spark-shell. I have a somewhat troublesome method, split by 'a' and "length - 1" is what i want. Here is the code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val test_data = sqlContext.read.json("music.json")
test_data.registerTempTable("test_data")
val temp1 = sqlContext.sql("select user.id_str as userid, text from test_data")
val temp2 = temp1.map(t => (t.getAs[String]("userid"),t.getAs[String]("text").split('a').length-1))
However, someone told me this is not remotely safe. I don't know why and can you give me a better way to do this?
Upvotes: 0
Views: 1626
Reputation: 330263
It is not safe because if value is NULL
:
val df = Seq((1, None), (2, Some("abcda"))).toDF("id", "text")
getAs[String]
will return null
:
scala> df.first.getAs[String]("text") == null
res1: Boolean = true
and split
will give NPE:
scala> df.first.getAs[String]("text").split("a")
java.lang.NullPointerException
...
which is most likely the situation you got in your previous question.
One simple solution:
import org.apache.spark.sql.functions._
val aCnt = coalesce(length(regexp_replace($"text", "[^a]", "")), lit(0))
df.withColumn("a_cnt", aCnt).show
// +---+-----+-----+
// | id| text|a_cnt|
// +---+-----+-----+
// | 1| null| 0|
// | 2|abcda| 2|
// +---+-----+-----+
If you want to make your code relatively safe you should either check for existence of null
:
def countAs1(s: String) = s match {
case null => 0
case chars => s.count(_ == 'a')
}
countAs1(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs1(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
or catch possible exceptions:
import scala.util.Try
def countAs2(s: String) = Try(s.count(_ == 'a')).toOption.getOrElse(0)
countAs2(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs2(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
Upvotes: 2