Reputation: 43
I have a DataFrame
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | VION00001 TRADE |
|32350|B | Curso | CUS11222 LEADER|
|32362|C | Curso | ITIN9876 EVALUA|
|32347|D | Curso | CUMPLI VION1234 |
|32036|E | Curso | EVAN1111 INFORM|
I need, that into formacion column remove the characters that start with VION|CUS|ITIN|VION|EVAN so Dataframe looks like
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | TRADE |
|32350|B | Curso | LEADER |
|32362|C | Curso | EVALUA |
|32347|D | Curso | CUMPLI |
|32036|E | Curso | INFORM |
+------+--------------------+-----------------+----
Thank you for your help
Upvotes: 3
Views: 311
Reputation: 31510
Use split
function to split the column by space
then get the last element of array.
Spark2.4+
use element_at
functionSpark < 2.4
use reverse(split(array))[0]
#using element_at
df.withColumn("formacion",element_at(split(col("formacion"),"\\s"),-1)).show()
#or using array_index
df.withColumn("formacion",split(col("formacion"),"\\s")[1]).show()
#split reverse and get first index value
df.withColumn("formacion",reverse(split(col("formacion"),"\\s"))[0]).show()
#+-----+--------------+----------+-------------+
#| id|titulo |tipo | formacion |
#+------+--------------------+-----------------+
#|32084|A | Material | TRADE |
#|32350|B | Curso | LEADER |
#|32362|C | Curso | EVALUA |
#|32347|D | Curso | CUMPLI |
#|32036|E | Curso | INFORM |
#+-----+--------------+----------+-------------+
Upvotes: 2
Reputation: 43
Sorry guys , this is original Column from DataFrame
formacion = [ VION00001 TRADE, CUS11222 LEADER,ITIN9876 EVALUA ,VION1234 CUMPLI,EVAN11 FR]
This is expected
formacion = [ TRADE, LEADER,EVALUA ,CUMPLI, FR]
Upvotes: 0