Reputation: 883
I have a code for example C78907. I want to split it:
C78 # level 1
C789 # Level2
C7890 # Level 3
C78907 # Level 4
So far what I m using:
Df3 = Df2.withColumn('Level_One', concat(Df2.code.substr(1, 3)))
Df4 = Df3.withColumn('Level_two', concat(Df3.code.substr(1, 4)))
Df5 = Df4.withColumn('Level_theree', concat(Df4.code.substr(1, 5)))
Df6 = Df5.withColumn('Level_four', concat(Df5.code.substr(1, 6)))
The issue is when looking to the results, the codes of level four ( supposed to be 6 components) might contain codes from level one or two or three.
721 7213 7213 7213
758 7580 7580 7580
724 7242 7242 7242
737 7373 73730 73730
789 7895 78959 78959
V06 V061 V061 V061
381 3810 38100 38100
Ideally a restriction could be useful. I mean:
null
instead of imputing with the previous one.The desired output :
Initial_code level1 level2 level3 level4
7213 721 7213 null null
7580 758 7580 null null
7242 724 7242 null null
73730 737 7373 73730 null
38100D 381 3810 38100 38100D
Upvotes: 1
Views: 7850
Reputation: 43494
You can achieve your desired output by using pyspark.sql.Column.when()
and pyspark.sql.functions.length()
. When creating the column, check if the substring will have the correct length. If it does not, set the column to None
using pyspark.sql.functions.lit()
.
For example:
import pyspark.sql.functions as f
df.withColumn('Level_One', f.when(
f.length(f.col('code').substr(1, 3)) == 3,
f.col('code').substr(1, 3)
).otherwise(f.lit(None)))\
.withColumn('Level_Two', f.when(
f.length(f.col('code').substr(1, 4)) == 4,
f.col('code').substr(1, 4)
).otherwise(f.lit(None)))\
.withColumn('Level_Three', f.when(
f.length(f.col('code').substr(1, 5)) == 5,
f.col('code').substr(1, 5)
).otherwise(f.lit(None)))\
.withColumn('Level_Four', f.when(
f.length(f.col('code').substr(1, 6)) == 6,
f.col('code').substr(1, 6)
).otherwise(f.lit(None)))\
.show()
Output:
+------+---------+---------+-----------+----------+
| Code|Level_One|Level_Two|Level_Three|Level_Four|
+------+---------+---------+-----------+----------+
| 7213| 721| 7213| null| null|
| 7580| 758| 7580| null| null|
| 7242| 724| 7242| null| null|
| 73730| 737| 7373| 73730| null|
|38100D| 381| 3810| 38100| 38100D|
+------+---------+---------+-----------+----------+
Upvotes: 4