YIWEN GONG
YIWEN GONG

Reputation: 141

Spark Scala Count the Occurrence of Array of strings in the Map Key

Currently, I have a dataframe which has two fields, name as

id1,              id2
Seq[String]      Map[String,(String,Long,Long)]

I would like to create another column with name rate, which is the percentage of the number of ids in id1 appear as the key of the map

It seems I was not able to fit a for loop inside the udf, wondering how should I do this?

Upvotes: 1

Views: 681

Answers (1)

akuiper
akuiper

Reputation: 214927

Use Seq.count and Map.isDefinedAt to check the number of keys existing in the Map and then simply wrap it with udf:

val df = Seq((Seq("a", "b", "c"), Map("a" -> ("x", 1L, 2L), "x" -> ("y", 2L,2L)))).toDF("id1", "id2")

type CustMap = Map[String, (String, Long, Long)]

def percent_in = udf(
    (id1: Seq[String], id2: CustMap) => id1.count(id2.isDefinedAt)/id1.length.toDouble
)

df.withColumn("rate", percent_in($"id1", $"id2")).show
+---------+--------------------+------------------+
|      id1|                 id2|              rate|
+---------+--------------------+------------------+
|[a, b, c]|Map(a -> [x,1,2],...|0.3333333333333333|
+---------+--------------------+------------------+

Upvotes: 1

Related Questions