Reputation:
I have below dataframe.
id,code
1,GSTR
2,GSTR
3,NA
4,NA
5,NA
here GSTR may change it can be anything. i want to replace NA with other string that is present in the same column.
In this case i want to replace NA with other string that is present in the column i.e GSTR. I tried to use UDFS but being an unknown string. I am not able to figure out.
Note: In this code column there will be only two strings. one will be "NA" and another can be anything in our case GSTR is another string
Expected output
1,GSTR
2,GSTR
3,GSTR
4,GSTR
5,GSTR
Upvotes: 0
Views: 711
Reputation: 5870
we can take the distinct string other than NA and use it,
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([(1,'GSTR'),(2,'GSTR'),(3,'NA'),(4,'NA'),(5,'NA')],['id','code'])
>>> df.show()
+---+----+
| id|code|
+---+----+
| 1|GSTR|
| 2|GSTR|
| 3| NA|
| 4| NA|
| 5| NA|
+---+----+
>>> rstr = df.where(df.code != 'NA')[['code']].first().code
>>> df.withColumn('code',F.lit(rstr)).show()
+---+----+
| id|code|
+---+----+
| 1|GSTR|
| 2|GSTR|
| 3|GSTR|
| 4|GSTR|
| 5|GSTR|
+---+----+
Hope this helps.
Upvotes: 1