Zhenya
Zhenya

Reputation: 13

how to concat values of columns in pyspark

I have a dataframe. A column "names" contains columns headers, which values should be concated. I want to do it with pyspark concat_ws() but nothing works. I must solve it with concat_ws() function, no pandas etc.

The best what I got it was concated headers, but not values in this columns. I couldn't return list from function to unpack it in concat_ws()

map_cols = {'a':'newA', 'b':'newB', 'c':'newC', 'd':'newD'}

@udf
def get_result(names_col):
    headers = []
    for i in names_col:
        headers.append(map_cols[i])
    return headers

df = df.withColumn('names_arr', split('names', '_')).withColumn('result', concat_ws(';', get_result(col('names_arr'))))
Input dataframe:

names   | newA|newB|newC|newD
---------------------------
a_b     |1    | 2  | 7  |8
---------------------------
a_b_c   |2    | 3  | 4  |4
---------------------------
a_b_c_d |3    | 2  |4   |4
---------------------------
c_d     | 89  |  5 |3   |5
---------------------------
b_c_d   |  7  |5   |6   | 5


Expected output dataframe

names   | newA|newB|newC|newD|result
--------------------------------------
a_b     |1    | 2  | 7  | 8  |1;2
--------------------------------------
a_b_c   |2    | 3  | 4  |4   |2;3;4
--------------------------------------
a_b_c_d |3    | 2  |4   |4   |2;3;4;4
--------------------------------------
c_d     |89   |  5 |3   |5   |3;5
--------------------------------------
b_c_d   |7    |5   |6   | 5  |5;6;5

Upvotes: 0

Views: 2097

Answers (1)

SMaZ
SMaZ

Reputation: 2655

I am assuming that in your expected output colA is a typo for the last two rows (89 and 7)

You can iterate the dataframe.columns and perform concat_ws

# Skip data prepare

#import 
import pyspark.sql.functions as f

df.show()
+-------+----+----+----+----+
|  names|newA|newB|newC|newD|
+-------+----+----+----+----+
|    a_b|   1|   2|null|null|
|  a_b_c|   2|   3|   4|null|
|a_b_c_d|   3|   2|   4|   4|
|    c_d|null|null|   3|   5|
|  b_c_d|null|   5|   6|   5|
+-------+----+----+----+----+

Filetring column name if its names and the concat by ; separator

df.withColumn('result', f.concat_ws(';', *[c for c in df.columns if c!='names'])).show()
+-------+----+----+----+----+-------+
|  names|newA|newB|newC|newD| result|
+-------+----+----+----+----+-------+
|    a_b|   1|   2|null|null|    1;2|
|  a_b_c|   2|   3|   4|null|  2;3;4|
|a_b_c_d|   3|   2|   4|   4|3;2;4;4|
|    c_d|null|null|   3|   5|    3;5|
|  b_c_d|null|   5|   6|   5|  5;6;5|
+-------+----+----+----+----+-------+

Upvotes: 1

Related Questions