Create N-Gram using Regular Expression in PySpark

Question

I have a pyspark dataframe column with names:

|   name     |
--------------
|Lebron James|
|Kyrie Irving|
|Kevin Durant|

I want to create a new column such as the following:

|   name     |         trigram          |
-----------------------------------------
|Lebron James| Leb ebr bro on  Jam ame es
|Kyrie Irving| ...
|Kevin Durant| ...

So far I have

df.withColumn("trigram", regex_replace(col("name"), "([A-Za-z0-9\s]{3})(?!$)", r"$1 "))

But this outputs:

|   name     |         trigram       |
--------------------------------------
|Lebron James| Leb ron Ja  mes
|Kyrie Irving| Kyr ie  Irv ing
|Kevin Durant| Kev in  Dur ant

Note: It is important to NOT use udfs. I could simply do what I want with a udf and list comprehension, but I'm looking to do this in the most optimal way since the actual data has hundreds of millions of rows

Wiktor Stribiżew · Accepted Answer

You can use

regex_replace(col("name"), "(?=(.{3})).", r"$1 ")

See the regex demo. Details:

(?=(.{3})) - a positive lookahead that captures (into Group 1, $1) the three chars other than line break chars immediately to the right of the current location
. - any char but a line break char, consumed (it will be removed, and replaced by the 3 char streak starting from this char).

Create N-Gram using Regular Expression in PySpark

Answers (1)

Related Questions