Gaaaa
Gaaaa

Reputation: 125

Pyspark: Split and conditional statements

I try to create a column called "w" in which If I split the values and then I create a conditional table in which If I find a value with the "<" smybol then that value should be substracted -0.1. When you find a value with "+" when you just should eliminate the +.

I tried this the split but I need to write the conditions.

Tahnk you for your help :)


dataframe = dataframe.withColumn("x", split(col("x"), "-").getItem(0))
data = [["1", "Amit", "DU", "I", "<25"],
        ["2", "Mohit", "DU", "I", "<25"],
        ["3", "rohith", "BHU", "I", 35-40],
        ["4", "sridevi", "LPU", "I", 30-35],
        ["1", "sravan", "KLMP", "M", 25-30],
        ["5", "gnanesh", "IIT", "M", 40-45],
       ["5", "gnadesh", "KLM", "c", "+45"]]

columns = ['ID', 'NAME', 'college', 'metric', 'x']


dataframe = spark.createDataFrame(data, columns)

My output is like this:

+---+-------+-------+------+--------
| ID|   NAME|college|metric|       x| 
+---+-------+-------+------+--------+
|  1|   Amit|     DU|     I|     <25|
|  2|  Mohit|     DU|     I|     <25|
|  3| rohith|    BHU|     I| 35 - 40|
|  4|sridevi|    LPU|     I| 30 - 35|  
|  1| sravan|   KLMP|     M| 25 - 30|  
|  5|gnanesh|    IIT|     M| 40 - 45|  
|  5|gnadesh|    KLM|     c|     +45| 
+---+-------+-------+------+--------+

My Output should look like this

+---+-------+-------+------+--------+----+
| ID|   NAME|college|metric|       x|   w|
+---+-------+-------+------+--------+----+
|  1|   Amit|     DU|     I|     <25|24.9|
|  2|  Mohit|     DU|     I|     <25|24.9|
|  3| rohith|    BHU|     I| 35 - 40|  35|
|  4|sridevi|    LPU|     I| 30 - 35|  30|
|  1| sravan|   KLMP|     M| 25 - 30| 25 | 
|  5|gnanesh|    IIT|     M| 40 - 45| 40 | 
|  5|gnadesh|    KLM|     c|     +45| 45 |
+---+-------+-------+------+--------+----+

Upvotes: 0

Views: 87

Answers (1)

Ronak Jain
Ronak Jain

Reputation: 3348

From what I understood, you have three conditions for values in column X (Let me know if this is not the case)

  • If the value is <X then the new column value will be X-0.1
  • If the value is X-Y then the new column value will be X
  • If the value is +X then the new column value will be 'X'

Thus this should work:

df.withColumn("NewColumn", \
          F.when(F.col("x").contains('<'), F.split("x", "<").getItem(1)-0.1)\
           .when(F.col("x").contains('-'), F.split("x", "-").getItem(0))\
           .when(F.col("x").contains("+"), F.split("x", "\\+").getItem(1)))\
  .show()

Input:

Input

Output:

Output

Upvotes: 2

Related Questions