ZygD
ZygD

Reputation: 24366

Split string to array of characters in Spark

How to split string column into array of characters?

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# +----------+
# |col_cities|
# +----------+
# |   Vilnius|
# |      Riga|
# |   Tallinn|
# |  New York|
# +----------+

Desired output:

# +----------+------------------------+
# |col_cities|split                   |
# +----------+------------------------+
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
# +----------+------------------------+

Upvotes: 1

Views: 3852

Answers (2)

ZygD
ZygD

Reputation: 24366

split can be used by providing empty string '' as separator. However, it will return empty string as the last array's element. So then slice is needed to remove the last array's element.

split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')

df.withColumn('split', split).show(truncate=0)
# +----------+------------------------+
# |col_cities|split                   |
# +----------+------------------------+
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
# +----------+------------------------+

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71689

You can use split with regex pattern having negative lookahead:

df.withColumn('split', F.split('col_cities', '(?!$)'))

+----------+------------------------+
|col_cities|split                   |
+----------+------------------------+
|Vilnius   |[V, i, l, n, i, u, s]   |
|Riga      |[R, i, g, a]            |
|Tallinn   |[T, a, l, l, i, n, n]   |
|New York  |[N, e, w,  , Y, o, r, k]|
+----------+------------------------+

Upvotes: 5

Related Questions