Reputation: 469
How to replace substrings of a string. For example, I created a data frame based on the following json format.
line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}
I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:
line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}
Please consider that this is just an example the real replacement is substring replacement not character replacement.
Any guidance either in Scala or Pyspark is helpful.
Upvotes: 2
Views: 15165
Reputation: 469
This is the way I solved it in PySpark:
def _name_replacement(val, ordered_mapping):
for key, value in ordered_mapping.items():
val = val.replace(key, value)
return val
mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))
Upvotes: 1
Reputation: 51271
Let's say you have a collection of strings for possible modification (simplified for this example).
val data = Seq("1:0.01"
,"3:0.03,4:0.04"
,"2:0.01,3:0.02"
,"5:0.02")
And you have a dictionary of required conversions.
val num2name = Map("1" -> "A"
,"2" -> "Bo"
,"3" -> "Cy"
,"4" -> "Dee")
From here you can use replaceSomeIn()
to make the substitutions.
data.map("(\\d+):".r //note: Map key is only part of the match pattern
.replaceSomeIn(_, m => num2name.get(m group 1) //get replacement
.map(_ + ":"))) //restore ":"
//res0: Seq[String] = List(A:0.01
// ,Cy:0.03,Dee:0.04
// ,Bo:0.01,Cy:0.02
// ,5:0.02)
As you can see, "5:"
is a match for the regex pattern but since the 5
part is not defined in num2name
, the string is left unchanged.
Upvotes: 1
Reputation: 497
from pyspark.sql.functions import *
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))
Details here: Pyspark replace strings in Spark dataframe column
Upvotes: 3