gowtham natrajan
gowtham natrajan

Reputation: 11

Replace comma only if the followed by integer in pyspark column

values=[("3","100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU"),("4","100;CalendarDay;g;440;Term;Degram")]
df=spark.createDataFrame(values,['id','derivate'])

I want to change the comma to pipe inside a column only if the comma is before an integer pyspark

input

|id  |derivate                                   |
+---+--------------------------------------------+
|3  |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU     |
|4  |100;CalendarDay;g;440;Term;Degram           |
+---+--------------------------------------------+

expected output

|id |derivate                                    |ITEMS                                       |
+---+--------------------------------------------+--------------------------------------------+
|3  |100;PerMonth;BBL;1500;Term;TkU,2500;Term;TEU|100;PerMonth;BBL|1500;Term;TkU|2500;Term;TEU|
|4  |100;CalendarDay;g;440;Term;Degram           |100;CalendarDay;g|440;Term;Degram           |

Upvotes: 1

Views: 337

Answers (1)

blackbishop
blackbishop

Reputation: 32660

You can use regexp_replace function with this regex [;,](?=\d+) to match all commas and semi-colons that are followed by digit:

from pyspark.sql import functions as F

df.withColumn(
    "ITEMS",
    F.regexp_replace(F.col("derivate"), "[;,](?=\\d+)", "|")
).show(truncate=False)

#+---+---------------------------------------+---------------------------------------+
#|id |derivate                               |ITEMS                                  |
#+---+---------------------------------------+---------------------------------------+
#|3  |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU|100;PerMonth;BB|1500;Tm;TkU|2500;Trm;TU|
#|4  |100;CalendarDay;g;440;Term;Degram      |100;CalendarDay;g|440;Term;Degram      |
#+---+---------------------------------------+---------------------------------------+

Upvotes: 1

Related Questions