Reputation: 21
I have a pyspark dataframe with names like:
Some of them contain dots and spaces between initials and some do not. How can they be converted to:
(with no dots and spaces between initials and 1 space between initials and name)
I tried using the following but it only replaces dots and doesn't remove spaces between initials:
names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))
Thanks!
Upvotes: 2
Views: 1624
Reputation: 42392
Some further manipulations can be done using transform
. The idea is to split the name using spaces, and recombine them by only adding spaces between strings that are longer than 1 char, such that initials will be combined without spaces.
import pyspark.sql.functions as F
names_modified = names.withColumn(
'name_clean',
F.expr("""
concat_ws('',
transform(
split(replace(name, '.', ''), ' '),
(x, i) -> case when i = 0 then x
when length(x) > 1 then concat(' ', x)
else x
end
)
)
""")
)
names_modified.show()
+------------------+-----------------+
| name| name_clean|
+------------------+-----------------+
| J.J. Scott| JJ Scott|
| J. S. Joyce| JS Joyce|
|RV. Bradley Carter|RV Bradley Carter|
+------------------+-----------------+
Upvotes: 1