Reputation: 230
I split a column with multiple underscores but now I am looking to remove the first index from that array... The element at the first index changes names as you go down the rows so can't remove based on any value..
Column
abc1_food_1_3
abc2_drink_2_6
abc4_2
split(df.Column, '_').alias('Split_Column')
Split_Column
[abc1, food, 1, 3]
[abc2, drink, 2, 6]
[abc4, 2]
now how can I yield:
Split_Column
[food, 1, 3]
[drink, 2, 6]
[2]
I will be converting the array column back to a string with underscores afterwards.. (concat_ws I believe?)
Upvotes: 1
Views: 5140
Reputation: 10372
You can also try below code.
expr("filter(Split_Column, (x,i) -> i != 0)").alias("Split_Column") // in this i is index of array.
Upvotes: 1
Reputation: 42352
If you simply want to remove the string before the first underscore, you can do:
df.selectExpr('substring_index(Column, "_", -size(split(Column, "_")) + 1)')
Example:
df = spark.createDataFrame([['abc1_food_1_3'],['abc2_drink_2_6'],['abc4_2']]).toDF('Column')
df
+--------------+
| Column|
+--------------+
| abc1_food_1_3|
|abc2_drink_2_6|
| abc4_2|
+--------------+
df = df.selectExpr('substring_index(Column, "_", -size(split(Column, "_"))+1) as trimmed')
df
+---------+
| trimmed|
+---------+
| food_1_3|
|drink_2_6|
| 2|
+---------+
Upvotes: 1
Reputation: 476
It seems this might be helpful . --
df=df.withColumn("Split_Column_PROCESSED", F.expr("slice(Split_Column, 2, SIZE(Split_Column))"))
i am adding a snippet using this .
It's performance might be better.
>>> df.printSchema()
root
|-- COLA: array (nullable = true)
| |-- element: long (containsNull = true)
>>> df.show()
+--------------------+
| COLA|
+--------------------+
| [1, 2, 4, 5]|
|[3, 57, 29, 34, 494]|
+--------------------+
import pyspark.sql.functions as F
df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))
>>> df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))
>>> df.show()
+--------------------+-----------------+
| COLA| FINAL|
+--------------------+-----------------+
| [1, 2, 4, 5]| [2, 4, 5]|
|[3, 57, 29, 34, 494]|[57, 29, 34, 494]|
+--------------------+-----------------+
Upvotes: 4
Reputation: 230
Of course after asking this I found a solution:
expr("filter(Split_Column, x -> not(x <=> Split_Column[0]))").alias('Split_Column')
Is there another way this can be done perhaps coupling array_remove and element_at?
Upvotes: 2