Reputation: 1213
I need to split sd_phonenumber based on ".", I need two columns :
sd_phone| phone|sd_phonenumber|
+--------+---------------+--------------+
| 1| 44 7723 466844| 44.7723466844|
| | 646-821-1444| 6468211444|
| | 405-455-5941| 4054555941|
| 1|886-4-2359-5958| 88.6423595958|
| 1|+1 973-340-9100| 1.9733409100|
| | 046-23 45 50| null|
| | 559-374-2385| 5593742385|
| | +507 395 3560| 5073953560|
| | 650-727-4950| 6507274950|
For example : 44.7723466844
col1 col2
44 7723466844
So far tried:
df_phone_cc=df_phone.with column("sd_phone", split(col("sd_phonenumber"), ".").getItem(0))
df_phone_cc=df_phone.withColumn("sd_phone" ,split_col.getItem(1))
I am getting null for the entire sd_phone columns. Can someone please help me out?
Upvotes: 1
Views: 1007
Reputation: 26676
In scripting, logic is as important as code. Here I prefix numbers that dont have a dot with 0.
and store in a new column. I split the new column with .
and save it in pysparks StructType column. I use inline to send each of the element in the struct column into its own column
Code
df_phone_cc=(
#Where there is no dot, introduce 0. and store in col new_sd_phonenumber
df_phone.withColumn("new_sd_phonenumber", when(col('sd_phonenumber').rlike('\.'),col('sd_phonenumber')).otherwise(concat_ws('.', F.lit('0'), col('sd_phonenumber'))))
#Split col new_sd_phonenumber by dot and save as Struct column
.withColumn("new_sd_phonenumber",F.array(F.struct(*[split(col("new_sd_phonenumber"), "\.")[i].alias(f"phone{i+1}")for i in range(2)])))
#Select all the columns
.selectExpr('sd_phone','phone','sd_phonenumber','inline(new_sd_phonenumber)')).show()
Output
+--------+---------------+--------------+------+----------+
|sd_phone| phone|sd_phonenumber|phone1| phone2|
+--------+---------------+--------------+------+----------+
| 1| 44 7723 466844| 44.7723466844| 44|7723466844|
| 1| 646-821-1444| 6468211444| 0|6468211444|
| 1| 405-455-5941| 4054555941| 0|4054555941|
| 1|886-4-2359-5958| 88.6423595958| 88|6423595958|
+--------+---------------+--------------+------+----------+
Upvotes: 1
Reputation: 2033
Since you have already got the item 0 when you first .withColumn()
the sd_phone
column, if you want to create the col1 and col2 respectively, you should store your array column first, that is:
df_phone_cc = df_phone.withColumn("new_col", split(col("sd_phonenumber"), "."))
df_phone_cc = df_phone_cc.withColumn("col1", col("new_col").getItem(0))
df_phone_cc = df_phone_cc.withColumn("col2", col("new_col").getItem(1))
Upvotes: 2