Prabhanj
Prabhanj

Reputation: 282

Pyspark concat column based on other columns values

I am working with pyspark below is my dataframe

col_list   | col_1| col_2 |...|col_n
col_1,col2 | x    | y     |...|n
col_2,col_n| a    | b     |...|z

I want concat columns which are present in my col_list column values

Expected DF:

col_list   | col_1| col_2 |...|col_n | concat_cols
col_1,col2 | x    | y     |...|n     | [col_1:x,col_2:y]
col_2,col_n| a    | b     |...|z     | [col_2:b,col_n:z]

Trying with below approaches:

  1. Creating udf to pass col_list and create expression to create array col like array(df['col_1'],df['col_2']) , but this will create column with string , not sure how to execute expression

  2. Using create map passing col_list in loop and create column, need to check on this.

Upvotes: 0

Views: 369

Answers (1)

mck
mck

Reputation: 42342

You can use some map methods:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'col_map', 
    F.map_from_arrays(
        F.array([F.lit(c) for c in df.columns[1:]]), 
        F.array(df.columns[1:])
    )
).withColumn(
    'concat_cols', 
    F.expr("""
        map_from_arrays(
            split(col_list, ','), 
            transform(split(col_list, ','), x -> col_map[x])
        )
    """)
).drop('col_map')

df2.show(truncate=False)
+-----------+-----+-----+-----+------------------------+
|col_list   |col_1|col_2|col_n|concat_cols             |
+-----------+-----+-----+-----+------------------------+
|col_1,col_2|x    |y    |n    |[col_1 -> x, col_2 -> y]|
|col_2,col_n|a    |b    |z    |[col_2 -> b, col_n -> z]|
+-----------+-----+-----+-----+------------------------+

If you prefer a string representation rather than a map type column, you can do

import pyspark.sql.functions as F

df2 = df.withColumn(
    'col_map', 
    F.map_from_arrays(
        F.array([F.lit(c) for c in df.columns[1:]]), 
        F.array(df.columns[1:])
    )
).withColumn(
    'concat_cols', 
    F.expr("""
        concat_ws(',', 
            transform(split(col_list, ','), x -> concat_ws(':', x, col_map[x]))
        )
    """)
).drop('col_map')

df2.show(truncate=False)
+-----------+-----+-----+-----+---------------+
|col_list   |col_1|col_2|col_n|concat_cols    |
+-----------+-----+-----+-----+---------------+
|col_1,col_2|x    |y    |n    |col_1:x,col_2:y|
|col_2,col_n|a    |b    |z    |col_2:b,col_n:z|
+-----------+-----+-----+-----+---------------+

Upvotes: 2

Related Questions