Bing Qian
Bing Qian

Reputation: 89

In PySpark how to add a new column based upon substring of an existent column?

Want to make use of a column of "ip" in a DataFrame, containing string of IP addresses, to add a new column called "ipClass" based upon the first part of IP "aaa.bbb.ccc.ddd" :

say, if aaa < 127, then "Class A" ; if aaa == 127, then "Loopback". Omitting the rest cases for convenience.

So originally have this dataframe :

+-----+---------------+
|index|      ip       |
+-----+---------------+
|    1| 10.xx.xx.xxx  |
|    2| 127.xxx.xxx.xx|
|    3| 100.xx.xxx.xx |
|    4| 9.xx.xxx.xx   |
+-----+---------------+

The new DataFrame I want to have is :

+-----+---------------+-----------+
|index|      ip       |   ipClass |
+-----+---------------+-----------+
|    1| 10.xx.xx.xxx  |  Class A  |
|    2| 127.xxx.xxx.xx|  Loopback |
|    3| 100.xx.xxx.xx |  Class A  |
|    4| 9.xx.xxx.xx   |  Class A  |
|    5| 129.xx.xxx.xx |  Class B  |
+-----+---------------+-----------+

Just wondering how the below question mark part should be look like ?

    .withColumn(""ipClass", when( _?_ < 127, "Class A").when( _?_ == 127, "Loopback" )."Loopback")

The ? part needs to take "aaa" from "aaa.bbb.ccc.ddd", and then convert to int.

Thank you in advance !

Upvotes: 1

Views: 571

Answers (2)

anky
anky

Reputation: 75080

You can also split on . , then get the first element and cast into integer and compare:

import pyspark.sql.functions as F

splitted_col = F.split("ip","\.")[0].cast("integer")
df.withColumn("ipClass",F.when(splitted_col < 127 ,"Class A")
                         .when(splitted_col == 127,"Loopback")
                         .when(splitted_col > 127 , "Class B").otherwise("Test")).show()

+-----+--------------+--------+
|index|            ip| ipClass|
+-----+--------------+--------+
|    1|  10.xx.xx.xxx| Class A|
|    2|127.xxx.xxx.xx|Loopback|
|    3| 100.xx.xxx.xx| Class A|
|    4|   9.xx.xxx.xx| Class A|
|    5| 129.xx.xxx.xx| Class B|
+-----+--------------+--------+

Upvotes: 4

Rayan Ral
Rayan Ral

Reputation: 1859

There is a pyspark.sql.functions.substring(str, pos, len) function to take a substring of a string, and then cast it to int with cast:

from pyspark.sql.types import IntegerType

df.withColumn("ipClass", when(substring(col("ip"), 0, 3).cast(IntegerType()) < 127, "Class A"))

Upvotes: 1

Related Questions