Reputation: 5480
I have a data frame like below in pyspark
data = [
(1, 'a', '', 'b', '', 'c', '123_abc', 'sam', 'NY'),
(2, 'b', 'abc_123', 'd', '', 'e', '', 'Tim', 'NJ'),
(3, 'c', '', 'f', '', 'g', '', 'Jim', 'SFO')]
df = sc.parallelize(data).toDF(["id", "abc_abled", "abc_serial", "bca_abled", "bca_serial", "cca_abled", "cca_serial", "name", "city"])
df
DataFrame[id: int, abc_abled: string, abc_serial: string, bca_abled: string, bca_serial: string, cca_abled: string, cca_serial: string, name: string, city: string]
df.show()
+---+---------+----------+---------+----------+---------+----------+----+----+
| id|abc_abled|abc_serial|bca_abled|bca_serial|cca_abled|cca_serial|name|city|
+---+---------+----------+---------+----------+---------+----------+----+----+
| 1| a| null| b| null| c| 123_abc| sam| NY|
| 2| b| abc_123| d| null| e| null| Tim| NJ|
| 3| c| null| f| null| g| null| Jim| SFO|
+---+---------+----------+---------+----------+---------+----------+----+----+
I want to create a new data frame by selecting the following columns and concatenate certain column values
.
df1
DataFrame[id:int, serial_number: string, name:string, city:string]
df1.show()
+---+-------------+----------+
| id|serial_number|name| city|
+---+-------------+----------+
| 1| 123_abc| sam| NY|
| 2| abc_123| Tim| NJ|
| 3| | Jim| SFO|
+---+-------------+----+-----+
Here serial_number
will be all columns that end with _serial
concatenated. If the columns have null values ignore the values and concatenate other strings
How can I achieve that?
Upvotes: 1
Views: 3453
Reputation: 41957
All you have to do is get the array of column names that ends with _serial
serialCols = [x for x in df.columns if str(x).endswith('_serial')]
Then use it with concat_ws
inbuilt function to concat the column values in select
expression as
from pyspark.sql import functions as f
df.select(
df['id'],
f.concat_ws('', *serialCols).alias('serial_number'),
df['name'],
df['city']
).show(truncate=False)
Here I have used an empty character to concat the strings
so the above code should give you
+---+-------------+----+----+
|id |serial_number|name|city|
+---+-------------+----+----+
|1 |123_abc |sam |NY |
|2 |abc_123 |Tim |NJ |
|3 | |Jim |SFO |
+---+-------------+----+----+
Edit: One could also use pyspark.sql.functions.concat()
in place of concat_ws()
.
Upvotes: 2