Reputation: 311
I have a pyspark dataframe with a Name column with sample values as follows:
id NAME
---+-------
1 aaa bb c
2 xx yy z
3 abc def
4 qw er
5 jon lee ls G
I have to flip the right most part and populate it on the left side with comma and delete the right most substring(split using the space)
Expected output
id NAME
---+-------
1 c, aaa bb
2 z, xx yy
3 def, abc
4 er, qw
5 G, jon lee ls
I was able to get the right most part to add it with comma by using below code: split_col=split(df['NAME'],' ') df2 = df2.withColumn('NAME_RIGHT',split_col.getItem(F.size(split_col) - 1))
the above line gives
NAME_RIGHT
c
z
def
er
I want to replace the values in NAME_RIGHT i.e. the right most values from the NAME column, I tried using below code but it replaces nothing, how can this be achieved?
df3 = df2.withColumn('NEW_NAME', regexp_replace(F.col("NAME"), str(df2.NAME_RIGHT),""))
Upvotes: 0
Views: 648
Reputation: 6082
Regex would be a bit cumbersome, I'd suggest to use split
and concat
instead.
from pyspark.sql import functions as F
(df
.withColumn('n1', F.split('name', ' '))
.withColumn('n2', F.reverse('n1')[0])
.withColumn('n3', F.concat_ws(' ', F.array_except('n1', F.array('n2'))))
.withColumn('n4', F.concat_ws(', ', F.array('n2', 'n3')))
.show()
)
# +---+------------+-----------------+---+----------+-------------+
# | id| name| n1| n2| n3| n4|
# +---+------------+-----------------+---+----------+-------------+
# | 1| aaa bb c| [aaa, bb, c]| c| aaa bb| c, aaa bb|
# | 2| xx yy z| [xx, yy, z]| z| xx yy| z, xx yy|
# | 3| abc def | [abc, def, ]| | abc def| , abc def|
# | 4| qw er| [qw, er]| er| qw| er, qw|
# | 5|jon lee ls G|[jon, lee, ls, G]| G|jon lee ls|G, jon lee ls|
# +---+------------+-----------------+---+----------+-------------+
Upvotes: 1