Reputation: 24366
How to split string column into array of characters?
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# +----------+
# |col_cities|
# +----------+
# | Vilnius|
# | Riga|
# | Tallinn|
# | New York|
# +----------+
Desired output:
# +----------+------------------------+
# |col_cities|split |
# +----------+------------------------+
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# +----------+------------------------+
Upvotes: 1
Views: 3852
Reputation: 24366
split
can be used by providing empty string ''
as separator. However, it will return empty string as the last array's element. So then slice
is needed to remove the last array's element.
split = "split(col_cities, '')"
split = F.expr(f'slice({split}, 1, size({split})-1)')
df.withColumn('split', split).show(truncate=0)
# +----------+------------------------+
# |col_cities|split |
# +----------+------------------------+
# |Vilnius |[V, i, l, n, i, u, s] |
# |Riga |[R, i, g, a] |
# |Tallinn |[T, a, l, l, i, n, n] |
# |New York |[N, e, w, , Y, o, r, k]|
# +----------+------------------------+
Upvotes: 2
Reputation: 71689
You can use split
with regex pattern having negative lookahead:
df.withColumn('split', F.split('col_cities', '(?!$)'))
+----------+------------------------+
|col_cities|split |
+----------+------------------------+
|Vilnius |[V, i, l, n, i, u, s] |
|Riga |[R, i, g, a] |
|Tallinn |[T, a, l, l, i, n, n] |
|New York |[N, e, w, , Y, o, r, k]|
+----------+------------------------+
Upvotes: 5