Reputation: 230

Pyspark remove first element of array

I split a column with multiple underscores but now I am looking to remove the first index from that array... The element at the first index changes names as you go down the rows so can't remove based on any value..

Column
abc1_food_1_3
abc2_drink_2_6
abc4_2

split(df.Column, '_').alias('Split_Column')

Split_Column
[abc1, food, 1, 3]
[abc2, drink, 2, 6]
[abc4, 2]

now how can I yield:

Split_Column
[food, 1, 3]
[drink, 2, 6]
[2]

I will be converting the array column back to a string with underscores afterwards.. (concat_ws I believe?)

Upvotes: 1

Answers (4)

s.polam

Reputation: 10372

You can also try below code.

expr("filter(Split_Column, (x,i) -> i != 0)").alias("Split_Column") // in this i is index of array.

Upvotes: 1

mck

Reputation: 42352

If you simply want to remove the string before the first underscore, you can do:

df.selectExpr('substring_index(Column, "_", -size(split(Column, "_")) + 1)')

Example:

df = spark.createDataFrame([['abc1_food_1_3'],['abc2_drink_2_6'],['abc4_2']]).toDF('Column')
df
+--------------+
|        Column|
+--------------+
| abc1_food_1_3|
|abc2_drink_2_6|
|        abc4_2|
+--------------+

df = df.selectExpr('substring_index(Column, "_", -size(split(Column, "_"))+1) as trimmed')
df
+---------+
|  trimmed|
+---------+
| food_1_3|
|drink_2_6|
|        2|
+---------+

Upvotes: 1

Aditya Vikram Singh

Reputation: 476

It seems this might be helpful . --

df=df.withColumn("Split_Column_PROCESSED", F.expr("slice(Split_Column, 2, SIZE(Split_Column))"))

i am adding a snippet using this .

It's performance might be better.

>>> df.printSchema()
root
 |-- COLA: array (nullable = true)
 |    |-- element: long (containsNull = true)
>>> df.show()
+--------------------+
|                COLA|
+--------------------+
|        [1, 2, 4, 5]|
|[3, 57, 29, 34, 494]|
+--------------------+


import pyspark.sql.functions as F

df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))

>>> df=df.withColumn("FINAL", F.expr("slice(COLA, 2, SIZE(COLA))"))
>>> df.show()
+--------------------+-----------------+
|                COLA|            FINAL|
+--------------------+-----------------+
|        [1, 2, 4, 5]|        [2, 4, 5]|
|[3, 57, 29, 34, 494]|[57, 29, 34, 494]|
+--------------------+-----------------+

Upvotes: 4

pdangelo4

Reputation: 230

Of course after asking this I found a solution:

expr("filter(Split_Column, x -> not(x <=> Split_Column[0]))").alias('Split_Column')

Is there another way this can be done perhaps coupling array_remove and element_at?

Upvotes: 2

Pyspark remove first element of array

Answers (4)

Related Questions