Pyspark dataframe split and pad delimited column value into Array of N index

Question

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:

X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123

Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".

Resulting output should be like below:

X	Col1	Col2	Col3	Col4
A123-B345-C44656-D4423-E3445-F5667	A123	B345	C44656	D4423
X123-Y345	A123	Y345
Z123-N345-T44656-M4423	Z123	N345	T44656	M4423
X123	X123

Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:

    def pad_infinite(siterable, padding=None):
        return chain(iterable, repeat(padding))

    def pad(iterable, size, padding=None):
        return islice(pad_infinite(iterable, padding), size)
    
    colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))

Pyspark dataframe split and pad delimited column value into Array of N index

Answers (1)

Related Questions