Reputation: 89
Want to make use of a column of "ip" in a DataFrame, containing string of IP addresses, to add a new column called "ipClass" based upon the first part of IP "aaa.bbb.ccc.ddd" :
say, if aaa < 127, then "Class A" ; if aaa == 127, then "Loopback". Omitting the rest cases for convenience.
So originally have this dataframe :
+-----+---------------+
|index| ip |
+-----+---------------+
| 1| 10.xx.xx.xxx |
| 2| 127.xxx.xxx.xx|
| 3| 100.xx.xxx.xx |
| 4| 9.xx.xxx.xx |
+-----+---------------+
The new DataFrame I want to have is :
+-----+---------------+-----------+
|index| ip | ipClass |
+-----+---------------+-----------+
| 1| 10.xx.xx.xxx | Class A |
| 2| 127.xxx.xxx.xx| Loopback |
| 3| 100.xx.xxx.xx | Class A |
| 4| 9.xx.xxx.xx | Class A |
| 5| 129.xx.xxx.xx | Class B |
+-----+---------------+-----------+
Just wondering how the below question mark part should be look like ?
.withColumn(""ipClass", when( _?_ < 127, "Class A").when( _?_ == 127, "Loopback" )."Loopback")
The ? part needs to take "aaa" from "aaa.bbb.ccc.ddd", and then convert to int.
Thank you in advance !
Upvotes: 1
Views: 571
Reputation: 75080
You can also split on .
, then get the first element and cast into integer and compare:
import pyspark.sql.functions as F
splitted_col = F.split("ip","\.")[0].cast("integer")
df.withColumn("ipClass",F.when(splitted_col < 127 ,"Class A")
.when(splitted_col == 127,"Loopback")
.when(splitted_col > 127 , "Class B").otherwise("Test")).show()
+-----+--------------+--------+
|index| ip| ipClass|
+-----+--------------+--------+
| 1| 10.xx.xx.xxx| Class A|
| 2|127.xxx.xxx.xx|Loopback|
| 3| 100.xx.xxx.xx| Class A|
| 4| 9.xx.xxx.xx| Class A|
| 5| 129.xx.xxx.xx| Class B|
+-----+--------------+--------+
Upvotes: 4
Reputation: 1859
There is a pyspark.sql.functions.substring(str, pos, len)
function to take a substring of a string, and then cast it to int with cast
:
from pyspark.sql.types import IntegerType
df.withColumn("ipClass", when(substring(col("ip"), 0, 3).cast(IntegerType()) < 127, "Class A"))
Upvotes: 1