Reputation: 99
Suppose I have a list new_id_acc = [6,8,1,2,4] and I have PySpark DataFrame like
id_acc | name |
10 | ABC |
20 | XYZ |
21 | KBC |
34 | RAH |
19 | SPD |
I want to replace the pyspark column id_acc with new_id_acc value how can I achieve and do this. I tried and found that lit() can be used but for a constant value but didn't find anything how to do for list.
After replacement I want my PySpark Dataframe to look like this
id_acc | name |
6 | ABC |
8 | XYZ |
1 | KBC |
2 | RAH |
4 | SPD |
Upvotes: 1
Views: 3505
Reputation: 1698
Probably long answer but it works.
df = spark.sparkContext.parallelize([(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'ABC'),(19,'SPD')]).toDF(('id_acc', 'name'))
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| ABC|
| 19| SPD|
+------+----+
new_id_acc = [6,8,1,2,4]
indx = ['ABC','XYZ','KBC','ABC','SPD']
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),StructField("new_id_ac", IntegerType(), True)])
df1=spark.createDataFrame(zip(indx,new_id_acc),schema = myschema)
df1.show()
+----+---------+
|indx|new_id_ac|
+----+---------+
| ABC| 6|
| XYZ| 8|
| KBC| 1|
| ABC| 2|
| SPD| 4|
+----+---------+
dfnew = df.join(df1, df.name == df1.indx,how='left').drop(df1.indx).select('new_id_ac','name').sort('name').dropDuplicates(['new_id_ac'])
dfnew.show()
+---------+----+
|new_id_ac|name|
+---------+----+
| 1| KBC|
| 6| ABC|
| 4| SPD|
| 8| XYZ|
| 2| ABC|
+---------+----+
Upvotes: 1
Reputation: 7585
The idea is to create a column of consecutive serial/row numbers
and then use them to get the corresponding values from the list.
# Creating the requisite DataFrame
from pyspark.sql.functions import row_number,lit, udf
from pyspark.sql.window import Window
valuesCol = [(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'RAH'),(19,'SPD')]
df = spark.createDataFrame(valuesCol,['id_acc','name'])
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| RAH|
| 19| SPD|
+------+----+
You can create row/serial numbers like done here.
Note that A
below is just a dummy
value, as we don't need to order tha values. We just want the row number
.
w = Window().orderBy(lit('A'))
df = df.withColumn('serial_number', row_number().over(w))
df.show()
+------+----+-------------+
|id_acc|name|serial_number|
+------+----+-------------+
| 10| ABC| 1|
| 20| XYZ| 2|
| 21| KBC| 3|
| 34| RAH| 4|
| 19| SPD| 5|
+------+----+-------------+
As a final step, we will access the elements from the list provided by the OP
using the row number
. For this we use udf
.
new_id_acc = [6,8,1,2,4]
mapping = udf(lambda x: new_id_acc[x-1])
df = df.withColumn('id_acc', mapping(df.serial_number)).drop('serial_number')
df.show()
+------+----+
|id_acc|name|
+------+----+
| 6| ABC|
| 8| XYZ|
| 1| KBC|
| 2| RAH|
| 4| SPD|
+------+----+
Upvotes: 1