Reputation: 709
I have an email column in a dataframe and I want to replace part of it with asterisks. I am unable to figure it out using PySpark functions.
My email column could be something like this"
email_col
[email protected]
[email protected]
What I want to achieve is this:
mod_email_col
ab**[email protected]
12*****[email protected]
So essentially apart from the 1st 2 characters and the last 2 characters, I want the remaining part to be replaced by asterisks.
This is what I tried
from pyspark.sql import functions as F
split_email = F.split(df.email_address, "@")
df = df.withColumn('email_part', split_email.getItem(0))
df = df.withColumn('start', df.email_part.substr(0,2))
df = df.withColumn('end', df.email_part.substr(-2,2))
df.withColumn(
'masked_part',
F.expr("regexp_replace(email_part, email_part[email_part.index(start)+len(start):email_part.index(end)], '*')")
).show(n=5)
Upvotes: 1
Views: 5576
Reputation: 13998
Your problem can be simplified using some string manipulations(Spark SQL functions: instr, concat, left, repeat, substr):
First find the position of the @
in the email string: pos_at = instr(email_col, '@')
, then the length of the username part is pos_at - 1
. If we take N=2
as the number of chars to be kept, then the number of chars to be masked should be pos_at - 1 - 2*N
, in the code, we have:
from pyspark.sql.functions import instr, expr
df = spark.createDataFrame(
[(e,) for e in ['[email protected]', '[email protected]', '[email protected]']]
, ['email_col']
)
# set N=2 as a parameter in the SQL expression
N = 2
df.withColumn('pos_at', instr('email_col', '@')) \
.withColumn('new_col', expr("""
CONCAT(LEFT(email_col,{0}), REPEAT('*', pos_at-1-2*{0}), SUBSTR(email_col, pos_at-{0}))
""".format(N))).show(truncate=False)
#+-------------------+------+-------------------+
#|email_col |pos_at|new_col |
#+-------------------+------+-------------------+
#|[email protected] |7 |ab**[email protected] |
#|[email protected]|10 |12*****[email protected]|
#|[email protected] |4 |[email protected] |
#+-------------------+------+-------------------+
Notice the issue with the last row when pos_at - 1 <= 2*N
, which has to be processed separately. If I define the following logic:
if `pos_at - 1 <= 2*N`: keep the first char and mask the rest
otherwise: keep the original processing routine
the whole processing can be wrapped up in a lambda function with two arguments (column_name
and N
)
# in the SQL expression, {0} is column_name and {1} is N
mask_email = lambda col_name, N: expr("""
IF(INSTR({0}, '@') <= {1}*2+1
, CONCAT(LEFT({0},1), REPEAT('*', INSTR({0}, '@')-2), SUBSTR({0}, INSTR({0}, '@')))
, CONCAT(LEFT({0},{1}), REPEAT('*', INSTR({0}, '@')-1-2*{1}), SUBSTR({0}, INSTR({0}, '@')-{1}))
) as `{0}_masked`
""".format(col_name, N))
df.select('*', mask_email('email_col', 2)).show()
#+-------------------+-------------------+
#| email_col| email_col_masked|
#+-------------------+-------------------+
#| [email protected]| ab**[email protected]|
#|[email protected]|12*****[email protected]|
#| [email protected]| a**@gmail.com|
#+-------------------+-------------------+
Upvotes: 0
Reputation: 43494
I think you can achieve this with the help of following regular expression: (?<=.{2})\w+(?=.{2}@)
(?<=.{2})
: Positive lookbehind for 2 characters\w+
: Any word characters(?=.{2}@)
: Positive lookahead for 2 characters followed by a literal @
First use regexp_extract
to extract this pattern from your string.
from pyspark.sql.functions import regexp_extract, regexp_replace
df = df.withColumn(
"pattern",
regexp_extract("email", r"(?<=.{2})\w+(?=.{2}@)", 0)
)
df.show()
#+-------------------+-------+
#| email|pattern|
#+-------------------+-------+
#| [email protected]| c1|
#|[email protected]| 3abc1|
#| [email protected]| |
#+-------------------+-------+
Then use regexp_replace
to create a replacement of *
of the same length.
df = df.withColumn(
"replacement",
regexp_replace("pattern", r"\w", "*")
)
df.show()
#+-------------------+-------+-----------+
#| email|pattern|replacement|
#+-------------------+-------+-----------+
#| [email protected]| c1| **|
#|[email protected]| 3abc1| *****|
#| [email protected]| | |
#+-------------------+-------+-----------+
Next use regexp_replace
again on the original email
column using the derived pattern
and replacement
columns.
To be safe, concat
the lookbehind/lookaheads from the original pattern when doing the replacment. To do this, we will have to use expr
in order to pass the column values as parameters.
from pyspark.sql.functions import concat, expr, lit
df = df.withColumn(
"mod_email_col",
expr("regexp_replace(email, concat('(?<=.{2})', pattern, '(?=.{2}@)'), replacement)")
)
df.show()
#+-------------------+-------+-----------+-------------------+
#| email|pattern|replacement| mod_email_col|
#+-------------------+-------+-----------+-------------------+
#| [email protected]| c1| **| ab**[email protected]|
#|[email protected]| 3abc1| *****|12*****[email protected]|
#| [email protected]| | | [email protected]|
#+-------------------+-------+-----------+-------------------+
Finally drop the intermediate columns:
df = df.drop("pattern", "replacement")
df.show()
#+-------------------+-------------------+
#| email| mod_email_col|
#+-------------------+-------------------+
#| [email protected]| ab**[email protected]|
#|[email protected]|12*****[email protected]|
#| [email protected]| [email protected]|
#+-------------------+-------------------+
Note: I added one test case to show that this does nothing if the email address part is 4 characters or less.
Update: Here are some ways you can handle edge cases where the email address part is less than 4 characters.
The rules I am using:
*
@
Code:
patA = "regexp_replace(email, concat('(?<=.{2})', pattern, '(?=.{2}@)'), replacement)"
patB = "regexp_replace(email, concat('(?<=.{1})', pattern, '(?=.{1}@)'), replacement)"
from pyspark.sql.functions import regexp_extract, regexp_replace
from pyspark.sql.functions import concat, expr, length, lit, split, when
df.withColumn("address_part", split("email", "@").getItem(0))\
.withColumn(
"pattern",
when(
length("address_part") > 5,
regexp_extract("email", r"(?<=.{2})\w+(?=.{2}@)", 0)
).otherwise(
regexp_extract("email", r"(?<=.{1})\w+(?=.{1}@)", 0)
)
).withColumn(
"replacement", regexp_replace("pattern", r"\w", "*")
).withColumn(
"mod_email_col",
when(
length("address_part") > 5, expr(patA)
).when(
length("address_part") > 3, expr(patB)
).otherwise(regexp_replace('email', '\w(?=@)', '*'))
).drop("pattern", "replacement", "address_part").show()
#+-------------------+-------------------+
#| email| mod_email_col|
#+-------------------+-------------------+
#| [email protected]| ab**[email protected]|
#|[email protected]|12*****[email protected]|
#| [email protected]| a***[email protected]|
#| [email protected]| a**[email protected]|
#| [email protected]| a*@test.com|
#| [email protected]| *@test.com|
#+-------------------+-------------------+
Upvotes: 2