Alan
Alan

Reputation: 469

Replace a substring of a string in pyspark dataframe

How to replace substrings of a string. For example, I created a data frame based on the following json format.

line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}

I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:

line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}

Please consider that this is just an example the real replacement is substring replacement not character replacement.

Any guidance either in Scala or Pyspark is helpful.

Upvotes: 2

Views: 15165

Answers (3)

Alan
Alan

Reputation: 469

This is the way I solved it in PySpark:

def _name_replacement(val, ordered_mapping):
    for key, value in ordered_mapping.items():
        val = val.replace(key, value)
    return val

mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

   

Upvotes: 1

jwvh
jwvh

Reputation: 51271

Let's say you have a collection of strings for possible modification (simplified for this example).

val data = Seq("1:0.01"
              ,"3:0.03,4:0.04"
              ,"2:0.01,3:0.02"
              ,"5:0.02")

And you have a dictionary of required conversions.

val num2name = Map("1" -> "A"
                  ,"2" -> "Bo"
                  ,"3" -> "Cy"
                  ,"4" -> "Dee")

From here you can use replaceSomeIn() to make the substitutions.

data.map("(\\d+):".r  //note: Map key is only part of the match pattern
                  .replaceSomeIn(_, m => num2name.get(m group 1)  //get replacement
                                                 .map(_ + ":")))  //restore ":"
//res0: Seq[String] = List(A:0.01
//                        ,Cy:0.03,Dee:0.04
//                        ,Bo:0.01,Cy:0.02
//                        ,5:0.02)

As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.

Upvotes: 1

P. Phalak
P. Phalak

Reputation: 497

from pyspark.sql.functions import *       
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))

Details here: Pyspark replace strings in Spark dataframe column

Upvotes: 3

Related Questions