Reputation: 10193
I need to get the second last word from a string value.
df = spark.createDataFrame([
["sample text 1 AFTEDGH XX"],
["sample text 2 GDHDH ZZ"],
["sample text 3 JEYHEHH YY"],
["sample text 4 QPRYRT EB"],
["sample text 5 KENBFBF XX"]
]).toDF("line")
+--------+
|word |
+--------+
|AFTEDGH |
|GDHDH |
|JEYHEHH |
|QPRYRT |
|KENBFBF |
+--------+
I tried:
df_new = df.withColumn('word', F.split(F.col('line'), ' ')[-2])
df_new = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[-2])
But they return Null
Upvotes: 2
Views: 1518
Reputation: 32650
You can also use substring_index
function :
from pyspark.sql import functions as F
df1 = df.withColumn(
"word",
F.substring_index(F.substring_index("line", " ", -2), " ", 1)
)
df1.show(truncate=False)
#+------------------------+-------+
#|line |word |
#+------------------------+-------+
#|sample text 1 AFTEDGH XX|AFTEDGH|
#|sample text 2 GDHDH ZZ |GDHDH |
#|sample text 3 JEYHEHH YY|JEYHEHH|
#|sample text 4 QPRYRT EB |QPRYRT |
#|sample text 5 KENBFBF XX|KENBFBF|
#+------------------------+-------+
Upvotes: 2
Reputation: 42352
To use negative indices, you can use element_at
:
import pyspark.sql.functions as F
df2 = df.withColumn('word', F.element_at(F.split(F.col('line'), ' '), -2))
df2.show(truncate=False)
+------------------------+-------+
|line |word |
+------------------------+-------+
|sample text 1 AFTEDGH XX|AFTEDGH|
|sample text 2 GDHDH ZZ |GDHDH |
|sample text 3 JEYHEHH YY|JEYHEHH|
|sample text 4 QPRYRT EB |QPRYRT |
|sample text 5 KENBFBF XX|KENBFBF|
+------------------------+-------+
Your second attempt is almost correct - just use a positive index because you have reversed the array already, and remember to subtract 1 from the index as well:
df2 = df.withColumn('word', F.reverse(F.split(F.col('line'), ' '))[1])
Upvotes: 5