Pyspark substring of one column based on the length of another column

Question

Using Pyspark 2.2

I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column

Input:

+-----+------+
|col_A| col_B|
+-----+------+
|  abc|abcdef|
|  abc|     a|
+-----+------+

Both col_A and col_B are StringType()

Desired output:

+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
|  abc|abcdef|    abc|
|  abc|     a|      a|
+-----+------+-------+

I want new_col to be a substring of col_A with the length of col_B.

I tried

udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()

But it gives the TypeError: Column is not iterable.

Any idea how to do such manipulation?

Pyspark substring of one column based on the length of another column

Answers (1)

Related Questions