bakun
bakun

Reputation: 475

Apply function to all elements in array<string> column

I have a column with arrays of strings, e.g. like this:

["test.a" "random.ac"]
["test.41" "random.23" "test.123"]

I want to get only the text before the ".". I did it only for the fist element of the array. How do I do it to all elements? Preferably without UDFs.

df = df.withColumn("address", substring_index(col("all_addresses").getItem(0), ".", 1)))

Upvotes: 0

Views: 1581

Answers (2)

ZygD
ZygD

Reputation: 24488

I would use a similar idea as @wwnde - transform function. transform takes an array, transforms every its element according to the provided function and results in the array of the same size, but with changed elements. Exactly what you need.

However, having the same original idea, I would probably implement it differently.
2 options:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(["test.a", "random.ac"],),
     (["test.41", "random.23", "test.123"],)],
    ['c1']
)

df = df.withColumn('c2', F.transform('c1', lambda x: F.element_at(F.split(x, '\.'), 1)))
df = df.withColumn('c3', F.transform('c1', lambda x: F.regexp_extract(x, r'(.+)\.', 1)))

df.show()
# +--------------------+--------------------+--------------------+
# |                  c1|                  c2|                  c3|
# +--------------------+--------------------+--------------------+
# | [test.a, random.ac]|      [test, random]|      [test, random]|
# |[test.41, random....|[test, random, test]|[test, random, test]|
# +--------------------+--------------------+--------------------+

Upvotes: 1

wwnde
wwnde

Reputation: 26676

Use regexp_extract. extra alphanumerics matches if only followed by special character .

+---+------------------------------+
|id |text                          |
+---+------------------------------+
|1  |[test.a, random.ac]           |
|2  |[test.41, random.23, test.123]|
+---+------------------------------+



 df.withColumn('new', expr("transform(text,x->regexp_extract(x,'[a-z0-9]+(?=\.)',0))")).show()

+---+--------------------+--------------------+
| id|                text|                 new|
+---+--------------------+--------------------+
|  1| [test.a, random.ac]|      [test, random]|
|  2|[test.41, random....|[test, random, test]|
+---+--------------------+--------------------+

Upvotes: 1

Related Questions