John Doe
John Doe

Reputation: 10193

Get second last word from string using PySpark

I need to get the second last word from a string value.

df = spark.createDataFrame([
  ["sample text 1 AFTEDGH XX"],
  ["sample text 2 GDHDH ZZ"],
  ["sample text 3 JEYHEHH YY"],
  ["sample text 4 QPRYRT EB"],
  ["sample text 5 KENBFBF XX"]
]).toDF("line")

+--------+
|word    |
+--------+
|AFTEDGH |
|GDHDH   |
|JEYHEHH |
|QPRYRT  |
|KENBFBF |
+--------+

I tried:

df_new = df.withColumn('word', F.split(F.col('line'), ' ')[-2])

df_new = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[-2])

But they return Null

Upvotes: 2

Views: 1518

Answers (2)

blackbishop
blackbishop

Reputation: 32650

You can also use substring_index function :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.substring_index(F.substring_index("line", " ", -2), " ", 1)
)

df1.show(truncate=False)

#+------------------------+-------+
#|line                    |word   |
#+------------------------+-------+
#|sample text 1 AFTEDGH XX|AFTEDGH|
#|sample text 2 GDHDH ZZ  |GDHDH  |
#|sample text 3 JEYHEHH YY|JEYHEHH|
#|sample text 4 QPRYRT EB |QPRYRT |
#|sample text 5 KENBFBF XX|KENBFBF|
#+------------------------+-------+

Upvotes: 2

mck
mck

Reputation: 42352

To use negative indices, you can use element_at:

import pyspark.sql.functions as F

df2 = df.withColumn('word', F.element_at(F.split(F.col('line'), ' '), -2))

df2.show(truncate=False)
+------------------------+-------+
|line                    |word   |
+------------------------+-------+
|sample text 1 AFTEDGH XX|AFTEDGH|
|sample text 2 GDHDH ZZ  |GDHDH  |
|sample text 3 JEYHEHH YY|JEYHEHH|
|sample text 4 QPRYRT EB |QPRYRT |
|sample text 5 KENBFBF XX|KENBFBF|
+------------------------+-------+

Your second attempt is almost correct - just use a positive index because you have reversed the array already, and remember to subtract 1 from the index as well:

df2 = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[1])

Upvotes: 5

Related Questions