Spark Scala - Need to iterate over column in dataframe

Question

Got the next dataframe:

+---+----------------+
|id |job_title       |
+---+----------------+
|1  |ceo             |
|2  |product manager |
|3  |surfer          |
+---+----------------+

I want to get a column from a dataframe and to create another column with indication called 'rank':

+---+----------------+-------+
|id |job_title       | rank  |
+---+----------------+-------+
|1  |ceo             |c-level|
|2  |product manager |manager|
|3  |surfer          |other  |
+---+----------------+-------+

--- UPDATED ---

What I tried to do by now is:

def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")

when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}

Currently I get a this error:

type mismatch;
found   : Boolean
required: org.apache.spark.sql.Column

and I think I have also other problems within the code.Sorry but I'm on a starting level with Scala over Spark.

Ramesh Maharjan · Accepted Answer

You can use when/otherwise inbuilt function for that case as

import org.apache.spark.sql.functions._
def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level")
  .otherwise(when(col("job_title").contains("manager"), "manager")
    .otherwise("other"))

and you can call the function by using withColumn as

df.withColumn("rank", func).show(false)

which should give you

+---+---------------+-------+
|id |job_title      |rank   |
+---+---------------+-------+
|1  |ceo            |c-level|
|2  |product manager|manager|
|3  |surfer         |other  |
+---+---------------+-------+

I hope the answer is helpful

Updated

I see that you have updated your post with your tryings, and you have tried creating a list of levels and you want to validate against the list. For that case you will have to write a udf function as

val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")

import org.apache.spark.sql.functions._
def rankUdf = udf((jobTitle: String) => jobTitle match {
  case x if(cLevel.exists(_.contains(x)) || cLevel.exists(x.contains(_))) => "C-Level"
  case x if(managerLevel.exists(_.contains(x)) || managerLevel.exists(x.contains(_))) => "manager"
  case _ => "other"
})

df.withColumn("rank", rankUdf(col("job_title"))).show(false)

which should give you your desired output

Spark Scala - Need to iterate over column in dataframe

Answers (2)

Related Questions