Lou_Ds
Lou_Ds

Reputation: 551

Scala Spark - Count occurrences of a specific string in Dataframe column

How can I count the occurrences of a String in a df Column using Spark partitioned by id?

e.g. Find the value "test" in column "name" of a df

In SQL would be:

 SELECT
    SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
  FROM
    mytable
 WINDOW window AS (PARTITION BY id)

I've tried using map( v => match { case "test" -> 1.. })

and things like:

def getCount(df: DataFrame): DataFrame = {
    val dfCnt = df.agg(
          .withColumn("cnt_test", 
            count(col("name")==lit('test'))
)

Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)?

thanks

Upvotes: 5

Views: 12159

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

If you want to translate your SQL, you can just also Window-functions in Spark as well:

def getCount(df: DataFrame): DataFrame = {
  import org.apache.spark.sql.expressions.Window

  df.withColumn("cnt_test",
      sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
    )
}

Upvotes: 0

akuiper
akuiper

Reputation: 214927

You can use groupBy + agg in spark; Here when($"name" == "test", 1) transforms name column to 1 if name == 'test', null otherwise, and count gives count of non null values:

df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))

Example:

val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

Or similar to your sql queries:

df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

Upvotes: 8

Related Questions