Reputation: 747
I have a pySpark dataframe with 4 columns (C1, C2, C3 and C4). In third column (C3) I have categorical values such as V1, V2, V3 and in fourth column (C4) I have its corresponding numeric values. I am trying to add additional columns V1, V2 and V3 where value of these new columns shall come from corresponding rows of 4th column (C4)
I am able to transpose row to columns through UDF
and DF.withColumn
but unable to bring the values
def valTocat(C3):
if C3 == 'xyz':
return 1
else:
return 0
but the following is not working
def valTocat((C3, C4)):
if C3 == 'xyz':
return C4
else:
return 0
Somehow I am unable to post the tabular format of the data but I think it is easy to visualize.
Some suggestion will be really appreciated
Upvotes: 1
Views: 873
Reputation: 24198
You can try pivot()
your DataFrame
:
from pyspark.sql.functions import expr
df.groupBy("c1","c2") \
.pivot("c3") \
.agg(expr("coalesce(first(c4))")).show()
You need the function coalesce
to substitute the missing values with a null
.
Upvotes: 2