Amit Singh
Amit Singh

Reputation: 43

Pyspark dataframe split and pad delimited column value into Array of N index

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:

X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123

Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".

Resulting output should be like below:

X Col1 Col2 Col3 Col4
A123-B345-C44656-D4423-E3445-F5667 A123 B345 C44656 D4423
X123-Y345 A123 Y345
Z123-N345-T44656-M4423 Z123 N345 T44656 M4423
X123 X123

Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:

    def pad_infinite(siterable, padding=None):
        return chain(iterable, repeat(padding))

    def pad(iterable, size, padding=None):
        return islice(pad_infinite(iterable, padding), size)
    
    colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))

Upvotes: 0

Views: 1431

Answers (1)

werner
werner

Reputation: 14845

You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:

df = ...
df.withColumn("arr", F.split("X", "-")) \
    .selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
    .na.fill("") \
    .show(truncate=False)

Output:

+----------------------------------+----+----+------+-----+
|X                                 |Col1|Col2|Col3  |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345                         |X123|Y345|      |     |
|Z123-N345-T44656-M4423            |Z123|N345|T44656|M4423|
|X123                              |X123|    |      |     |
+----------------------------------+----+----+------+-----+

Upvotes: 2

Related Questions