leo_val
leo_val

Reputation: 150

Mapping different ids in PySpark

I have a dataset that I'd like to work with in PySpark. One DF has the form:

+--------+----------------------------+
|artistid|artist_name                 |
+--------+----------------------------+
|1134999 |06Crazy Life                |
|6821360 |Pang Nakarin                |
|10113088|Terfel, Bartoli- Mozart: Don|
|10151459|The Flaming Sidebur         |
|6826647 |Bodenstandig 3000           |
+--------+----------------------------+

And, as it looks like in the above preview, some artist's names are bad spelled, and so there are duplicated artists but with different IDs.

I have another another DF that looks like this:

+--------+-------+
|badid   |goodid |
+--------+-------+
|1092764 |1000311|
|1095122 |1000557|
|6708070 |1007267|
|10088054|1042317|
|1195917 |1042317|
+--------+-------+

That is, it maps each "bad id" from the previous DF into one single "good id".

Now what I would like to do: Replace every bad id in the first DF to the correspondent good id in the second DF. I'm guessing I would have to do a map of some sort, but I have never worked with .map in PySpark before.

How can I map a value from one DF, to a value in another DF, in a case like this?

Upvotes: 0

Views: 289

Answers (1)

pltc
pltc

Reputation: 6082

How about 2 left joins to figure which good id is mapped to which bad id?

df1
df1 = spark.createDataFrame([
    (1134999, '06Crazy Life'),
    (6821360, 'Pang Nakarin'),
    (10113088, 'Terfel, Bartoli- Mozart: Don'),
    (10151459, 'The Flaming Sidebur'),
    (6826647, 'Bodenstandig 3000'),
    (11223344, 'Crazy Life'),
], ['artistid', 'artist_name'])
df2
df2 = spark.createDataFrame([
    (1092764, 1000311),
    (1095122, 1000557),
    (6708070, 1007267),
    (10088054, 1042317),
    (1195917, 1042317),
    (1134999, 11223344),
], ['badid', 'goodid'])
figure the good artist name
(df1
    .alias('a1')
    .join(df2.alias('id'), on=[F.col('a1.artistid') == F.col('id.badid')], how='left')
    .join(df1.alias('a2'), on=[F.col('a2.artistid') == F.col('id.goodid')], how='left')
    .show()
)

+--------+--------------------+-------+--------+--------+-----------+
|artistid|         artist_name|  badid|  goodid|artistid|artist_name|
+--------+--------------------+-------+--------+--------+-----------+
| 6821360|        Pang Nakarin|   null|    null|    null|       null|
| 6826647|   Bodenstandig 3000|   null|    null|    null|       null|
|10113088|Terfel, Bartoli- ...|   null|    null|    null|       null|
|10151459| The Flaming Sidebur|   null|    null|    null|       null|
|11223344|          Crazy Life|   null|    null|    null|       null|
| 1134999|        06Crazy Life|1134999|11223344|11223344| Crazy Life|
+--------+--------------------+-------+--------+--------+-----------+

Upvotes: 1

Related Questions