Reputation: 11
values=[("3","100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU"),("4","100;CalendarDay;g;440;Term;Degram")]
df=spark.createDataFrame(values,['id','derivate'])
I want to change the comma to pipe inside a column only if the comma is before an integer pyspark
input
|id |derivate |
+---+--------------------------------------------+
|3 |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU |
|4 |100;CalendarDay;g;440;Term;Degram |
+---+--------------------------------------------+
expected output
|id |derivate |ITEMS |
+---+--------------------------------------------+--------------------------------------------+
|3 |100;PerMonth;BBL;1500;Term;TkU,2500;Term;TEU|100;PerMonth;BBL|1500;Term;TkU|2500;Term;TEU|
|4 |100;CalendarDay;g;440;Term;Degram |100;CalendarDay;g|440;Term;Degram |
Upvotes: 1
Views: 337
Reputation: 32660
You can use regexp_replace
function with this regex [;,](?=\d+)
to match all commas and semi-colons that are followed by digit:
from pyspark.sql import functions as F
df.withColumn(
"ITEMS",
F.regexp_replace(F.col("derivate"), "[;,](?=\\d+)", "|")
).show(truncate=False)
#+---+---------------------------------------+---------------------------------------+
#|id |derivate |ITEMS |
#+---+---------------------------------------+---------------------------------------+
#|3 |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU|100;PerMonth;BB|1500;Tm;TkU|2500;Trm;TU|
#|4 |100;CalendarDay;g;440;Term;Degram |100;CalendarDay;g|440;Term;Degram |
#+---+---------------------------------------+---------------------------------------+
Upvotes: 1