Reputation: 13
In one of my projects, I need to transform a string column whose values looks like below
"[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz"
"[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz"
I only need the alphanumeric values after "All/" and before "_ID", so the 1st record should be "abc12345" and second record should be "abc12".
in PySpark, I am using substring
in withColumn
to get the first 8 strings after "ALL/" position which gives me "abc12345" and "abc12_ID".
Then I am using regexp_replace
in withColumn
to check if rlike
is "_ID$", then replace "_ID" with "", otherwise keep the column value. This is giving the expected result: "abc12345" and "abc12".
But is there a better solution for this?
Upvotes: 1
Views: 5647
Reputation: 24386
Maybe like this? In one regexp_extract
?
F.regexp_extract('col_name', r'ALL\/([^\W_]+)', 1)
Test:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz",),
("[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz",)],
["col_name"])
df = df.withColumn("col2", F.regexp_extract("col_name", r"ALL\/([^\W_]+)", 1))
df.show(truncate=0)
# +------------------------------------------------------+--------+
# |col_name |col2 |
# +------------------------------------------------------+--------+
# |[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz|abc12345|
# |[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz |abc12 |
# +------------------------------------------------------+--------+
Upvotes: 3