Reputation: 139
I am new to Spark and I have a requirement which need to generate multiple rows and columns from single row.
Input:
col1 col2 col3 col4
Output
col1 col2 col3 col4 col5 col6 col7
col1 col2 col3 col4 col8 col9 col10
Logics for new columns:
**col5 :**
if col1==0 and col3!=0:
col5 = col4/col3
else:
col5 = 0
**col6 :**
if col1==0 and col4!=0:
col6 = (col3*col4)/col1
else:
col6 = 0
For first row col7 holds same value as col2
**col8 :**
if col1!=0 and col3!=0:
col8 = col4/col3
else:
col8 = 0
**col9 :**
if col1!=0 and col4!=0:
col9 = (col3*col4)/col1
else:
col9 = 0
For second row col10 = col2+ "_NEW"
At the end 'sum' function needs to be applied with group by. Hope that would be easy once we convert the above structure.
Most of the article in google explained about how to add single columns to existing dataframe using "withcolumn" option not multiple columns. None of the article explained about this scenario. So I would like to kindly ask for your assistance.
Upvotes: 2
Views: 7601
Reputation: 11985
Hope this helps!
from pyspark.sql.functions import col, when, lit, concat, round, sum
#sample data
df = sc.parallelize([(1, 2, 3, 4), (5, 6, 7, 8)]).toDF(["col1", "col2", "col3", "col4"])
#populate col5, col6, col7
col5 = when((col('col1') == 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col6 = when((col('col1') == 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col7 = col('col2')
df1 = df.withColumn("col5", col5).\
withColumn("col6", col6).\
withColumn("col7", col7)
#populate col8, col9, col10
col8 = when((col('col1') != 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col9 = when((col('col1') != 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col10= concat(col('col2'), lit("_NEW"))
df2 = df.withColumn("col5", col8).\
withColumn("col6", col9).\
withColumn("col7", col10)
#final dataframe
final_df = df1.union(df2)
final_df.show()
#groupBy calculation
#final_df.groupBy("col1", "col2", "col3", "col4").agg(sum("col5")).show()
Output is:
+----+----+----+----+----+----+-----+
|col1|col2|col3|col4|col5|col6| col7|
+----+----+----+----+----+----+-----+
| 1| 2| 3| 4| 0.0| 0.0| 2|
| 5| 6| 7| 8| 0.0| 0.0| 6|
| 1| 2| 3| 4|1.33|12.0|2_NEW|
| 5| 6| 7| 8|1.14|11.2|6_NEW|
+----+----+----+----+----+----+-----+
Don't forget to let us know if it solved your problem :)
Upvotes: 4
Reputation: 1370
there are few options:
withColumn
as many times as you need(i.e. how many columns you need to add)map
on data frame to parse columns and return Row
with proper columns and create DataFrame afterwards.Upvotes: 1