I have a pyspark dataframe with names like: J.J. Scott J. S. Joyce RV. Bradley Carter Some of them contain dots and spaces between initials and some do not. How can they be converted to: JJ Scott JS Joyce RV Bradley Carter (with no dots and spaces between initials and 1 space between initials and name) I tried using the following but it only replaces dots and doesn't remove spaces between initials: names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.","")) Thanks!

dataframeapache-sparkpysparkapache-spark-sqlregexp-replace

HelloScala

Reputation: 21

How to remove dots and spaces in initials of a name

I have a pyspark dataframe with names like:

J.J. Scott
J. S. Joyce
RV. Bradley Carter

Some of them contain dots and spaces between initials and some do not. How can they be converted to:

JJ Scott
JS Joyce
RV Bradley Carter

(with no dots and spaces between initials and 1 space between initials and name)

I tried using the following but it only replaces dots and doesn't remove spaces between initials:

names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))

Thanks!

Upvotes: 2

Answers (1)

mck

Reputation: 42392

Some further manipulations can be done using transform. The idea is to split the name using spaces, and recombine them by only adding spaces between strings that are longer than 1 char, such that initials will be combined without spaces.

import pyspark.sql.functions as F

names_modified = names.withColumn(
    'name_clean',
    F.expr("""
        concat_ws('',
            transform(
                split(replace(name, '.', ''), ' '),
                (x, i) -> case when i = 0 then x
                               when length(x) > 1 then concat(' ', x)
                               else x
                          end
            )
        )
    """)
)

names_modified.show()
+------------------+-----------------+
|              name|       name_clean|
+------------------+-----------------+
|        J.J. Scott|         JJ Scott|
|       J. S. Joyce|         JS Joyce|
|RV. Bradley Carter|RV Bradley Carter|
+------------------+-----------------+

Upvotes: 1

How to remove dots and spaces in initials of a name

Answers (1)

Related Questions