HelloScala
HelloScala

Reputation: 21

How to remove dots and spaces in initials of a name

I have a pyspark dataframe with names like:

Some of them contain dots and spaces between initials and some do not. How can they be converted to:

(with no dots and spaces between initials and 1 space between initials and name)

I tried using the following but it only replaces dots and doesn't remove spaces between initials:

names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))

Thanks!

Upvotes: 2

Views: 1624

Answers (1)

mck
mck

Reputation: 42392

Some further manipulations can be done using transform. The idea is to split the name using spaces, and recombine them by only adding spaces between strings that are longer than 1 char, such that initials will be combined without spaces.

import pyspark.sql.functions as F

names_modified = names.withColumn(
    'name_clean',
    F.expr("""
        concat_ws('',
            transform(
                split(replace(name, '.', ''), ' '),
                (x, i) -> case when i = 0 then x
                               when length(x) > 1 then concat(' ', x)
                               else x
                          end
            )
        )
    """)
)

names_modified.show()
+------------------+-----------------+
|              name|       name_clean|
+------------------+-----------------+
|        J.J. Scott|         JJ Scott|
|       J. S. Joyce|         JS Joyce|
|RV. Bradley Carter|RV Bradley Carter|
+------------------+-----------------+

Upvotes: 1

Related Questions