Reputation: 150
I have a dataset that I'd like to work with in PySpark. One DF has the form:
+--------+----------------------------+
|artistid|artist_name |
+--------+----------------------------+
|1134999 |06Crazy Life |
|6821360 |Pang Nakarin |
|10113088|Terfel, Bartoli- Mozart: Don|
|10151459|The Flaming Sidebur |
|6826647 |Bodenstandig 3000 |
+--------+----------------------------+
And, as it looks like in the above preview, some artist's names are bad spelled, and so there are duplicated artists but with different IDs.
I have another another DF that looks like this:
+--------+-------+
|badid |goodid |
+--------+-------+
|1092764 |1000311|
|1095122 |1000557|
|6708070 |1007267|
|10088054|1042317|
|1195917 |1042317|
+--------+-------+
That is, it maps each "bad id" from the previous DF into one single "good id".
Now what I would like to do: Replace every bad id in the first DF to the correspondent good id in the second DF. I'm guessing I would have to do a map of some sort, but I have never worked with .map in PySpark before.
How can I map a value from one DF, to a value in another DF, in a case like this?
Upvotes: 0
Views: 289
Reputation: 6082
How about 2 left joins to figure which good id is mapped to which bad id?
df1
df1 = spark.createDataFrame([
(1134999, '06Crazy Life'),
(6821360, 'Pang Nakarin'),
(10113088, 'Terfel, Bartoli- Mozart: Don'),
(10151459, 'The Flaming Sidebur'),
(6826647, 'Bodenstandig 3000'),
(11223344, 'Crazy Life'),
], ['artistid', 'artist_name'])
df2
df2 = spark.createDataFrame([
(1092764, 1000311),
(1095122, 1000557),
(6708070, 1007267),
(10088054, 1042317),
(1195917, 1042317),
(1134999, 11223344),
], ['badid', 'goodid'])
(df1
.alias('a1')
.join(df2.alias('id'), on=[F.col('a1.artistid') == F.col('id.badid')], how='left')
.join(df1.alias('a2'), on=[F.col('a2.artistid') == F.col('id.goodid')], how='left')
.show()
)
+--------+--------------------+-------+--------+--------+-----------+
|artistid| artist_name| badid| goodid|artistid|artist_name|
+--------+--------------------+-------+--------+--------+-----------+
| 6821360| Pang Nakarin| null| null| null| null|
| 6826647| Bodenstandig 3000| null| null| null| null|
|10113088|Terfel, Bartoli- ...| null| null| null| null|
|10151459| The Flaming Sidebur| null| null| null| null|
|11223344| Crazy Life| null| null| null| null|
| 1134999| 06Crazy Life|1134999|11223344|11223344| Crazy Life|
+--------+--------------------+-------+--------+--------+-----------+
Upvotes: 1