How to add multiple row and multiple column from single row in pyspark?

I am new to Spark and I have a requirement which need to generate multiple rows and columns from single row.

Input:

col1   col2  col3  col4

Output

col1 col2   col3  col4 col5 col6 col7 

col1 col2   col3  col4 col8 col9 col10

Logics for new columns:

**col5 :**

if col1==0 and col3!=0:
   col5 = col4/col3

else: 
   col5 = 0


**col6 :**

if col1==0 and col4!=0:
   col6 = (col3*col4)/col1

else: 
   col6 = 0

For first row col7 holds same value as col2

**col8 :**

if col1!=0 and col3!=0:
   col8 = col4/col3

else: 
   col8 = 0
**col9 :**

if col1!=0 and col4!=0:
   col9 = (col3*col4)/col1

else: 
   col9 = 0

For second row col10 = col2+ "_NEW"

At the end 'sum' function needs to be applied with group by. Hope that would be easy once we convert the above structure.

Most of the article in google explained about how to add single columns to existing dataframe using "withcolumn" option not multiple columns. None of the article explained about this scenario. So I would like to kindly ask for your assistance.

Upvotes: 2

Answers (2)

Prem

Reputation: 11985

Hope this helps!

from pyspark.sql.functions import col, when, lit, concat, round, sum

#sample data
df = sc.parallelize([(1, 2, 3, 4), (5, 6, 7, 8)]).toDF(["col1", "col2", "col3", "col4"])

#populate col5, col6, col7
col5 = when((col('col1') == 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col6 = when((col('col1') == 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col7 = col('col2')
df1 = df.withColumn("col5", col5).\
    withColumn("col6", col6).\
    withColumn("col7", col7)

#populate col8, col9, col10
col8 = when((col('col1') != 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col9 = when((col('col1') != 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col10= concat(col('col2'), lit("_NEW"))
df2 = df.withColumn("col5", col8).\
    withColumn("col6", col9).\
    withColumn("col7", col10)

#final dataframe
final_df = df1.union(df2)
final_df.show()

#groupBy calculation
#final_df.groupBy("col1", "col2", "col3", "col4").agg(sum("col5")).show()

Output is:

+----+----+----+----+----+----+-----+
|col1|col2|col3|col4|col5|col6| col7|
+----+----+----+----+----+----+-----+
|   1|   2|   3|   4| 0.0| 0.0|    2|
|   5|   6|   7|   8| 0.0| 0.0|    6|
|   1|   2|   3|   4|1.33|12.0|2_NEW|
|   5|   6|   7|   8|1.14|11.2|6_NEW|
+----+----+----+----+----+----+-----+

Don't forget to let us know if it solved your problem :)

Upvotes: 4

iurii_n

Reputation: 1370

there are few options:

use withColumn as many times as you need(i.e. how many columns you need to add)
use map on data frame to parse columns and return Row with proper columns and create DataFrame afterwards.

Upvotes: 1

How to add multiple row and multiple column from single row in pyspark?

Answers (2)

Related Questions