Reputation: 225
I have a pyspark dataframe column with names:
| name |
--------------
|Lebron James|
|Kyrie Irving|
|Kevin Durant|
I want to create a new column such as the following:
| name | trigram |
-----------------------------------------
|Lebron James| Leb ebr bro on Jam ame es
|Kyrie Irving| ...
|Kevin Durant| ...
So far I have
df.withColumn("trigram", regex_replace(col("name"), "([A-Za-z0-9\s]{3})(?!$)", r"$1 "))
But this outputs:
| name | trigram |
--------------------------------------
|Lebron James| Leb ron Ja mes
|Kyrie Irving| Kyr ie Irv ing
|Kevin Durant| Kev in Dur ant
Note: It is important to NOT use udfs. I could simply do what I want with a udf and list comprehension, but I'm looking to do this in the most optimal way since the actual data has hundreds of millions of rows
Upvotes: 2
Views: 164
Reputation: 627103
You can use
regex_replace(col("name"), "(?=(.{3})).", r"$1 ")
See the regex demo. Details:
(?=(.{3}))
- a positive lookahead that captures (into Group 1, $1
) the three chars other than line break chars immediately to the right of the current location.
- any char but a line break char, consumed (it will be removed, and replaced by the 3 char streak starting from this char).Upvotes: 3