Xi12
Xi12

Reputation: 1213

How to use split in pyspark

I need to split sd_phonenumber based on ".", I need two columns :

sd_phone|          phone|sd_phonenumber|
+--------+---------------+--------------+
|       1| 44 7723 466844| 44.7723466844|
|        |   646-821-1444|    6468211444|
|        |   405-455-5941|    4054555941|
|       1|886-4-2359-5958| 88.6423595958|
|       1|+1 973-340-9100|  1.9733409100|
|        |   046-23 45 50|          null|
|        |   559-374-2385|    5593742385|
|        |  +507 395 3560|    5073953560|
|        |   650-727-4950|    6507274950|

For example : 44.7723466844

col1    col2  
44     7723466844 

So far tried:

 df_phone_cc=df_phone.with column("sd_phone", split(col("sd_phonenumber"), ".").getItem(0))
 df_phone_cc=df_phone.withColumn("sd_phone" ,split_col.getItem(1))

I am getting null for the entire sd_phone columns. Can someone please help me out?

Upvotes: 1

Views: 1007

Answers (2)

wwnde
wwnde

Reputation: 26676

In scripting, logic is as important as code. Here I prefix numbers that dont have a dot with 0. and store in a new column. I split the new column with . and save it in pysparks StructType column. I use inline to send each of the element in the struct column into its own column

Code

df_phone_cc=(
  #Where there is no dot, introduce 0. and store in col new_sd_phonenumber
  df_phone.withColumn("new_sd_phonenumber", when(col('sd_phonenumber').rlike('\.'),col('sd_phonenumber')).otherwise(concat_ws('.', F.lit('0'), col('sd_phonenumber'))))
  #Split col new_sd_phonenumber by dot and save as Struct column
  .withColumn("new_sd_phonenumber",F.array(F.struct(*[split(col("new_sd_phonenumber"), "\.")[i].alias(f"phone{i+1}")for i in range(2)])))
  #Select all the columns
  .selectExpr('sd_phone','phone','sd_phonenumber','inline(new_sd_phonenumber)')).show()

Output

+--------+---------------+--------------+------+----------+
|sd_phone|          phone|sd_phonenumber|phone1|    phone2|
+--------+---------------+--------------+------+----------+
|       1| 44 7723 466844| 44.7723466844|    44|7723466844|
|       1|   646-821-1444|    6468211444|     0|6468211444|
|       1|   405-455-5941|    4054555941|     0|4054555941|
|       1|886-4-2359-5958| 88.6423595958|    88|6423595958|
+--------+---------------+--------------+------+----------+

Upvotes: 1

Jonathan
Jonathan

Reputation: 2033

Since you have already got the item 0 when you first .withColumn() the sd_phone column, if you want to create the col1 and col2 respectively, you should store your array column first, that is:

df_phone_cc = df_phone.withColumn("new_col", split(col("sd_phonenumber"), "."))
df_phone_cc = df_phone_cc.withColumn("col1", col("new_col").getItem(0))
df_phone_cc = df_phone_cc.withColumn("col2", col("new_col").getItem(1))

Upvotes: 2

Related Questions