Reputation: 43
There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:
X |
---|
A123-B345-C44656-D4423-E3445-F5667 |
X123-Y345 |
Z123-N345-T44656-M4423 |
X123 |
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
X | Col1 | Col2 | Col3 | Col4 |
---|---|---|---|---|
A123-B345-C44656-D4423-E3445-F5667 | A123 | B345 | C44656 | D4423 |
X123-Y345 | A123 | Y345 | ||
Z123-N345-T44656-M4423 | Z123 | N345 | T44656 | M4423 |
X123 | X123 |
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))
Upvotes: 0
Views: 1431
Reputation: 14845
You can split the string into an array, separate the elements of the array into columns and then fill the null
values with an empty string:
df = ...
df.withColumn("arr", F.split("X", "-")) \
.selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
.na.fill("") \
.show(truncate=False)
Output:
+----------------------------------+----+----+------+-----+
|X |Col1|Col2|Col3 |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345 |X123|Y345| | |
|Z123-N345-T44656-M4423 |Z123|N345|T44656|M4423|
|X123 |X123| | | |
+----------------------------------+----+----+------+-----+
Upvotes: 2