silviacamplani
silviacamplani

Reputation: 361

How to remove 'duplicate' rows from joining the same pyspark dataframe?

I need to show a dataframe made by three columns. Two of them show the names of someone who worked in a common movie (indicated by code on the third code)

here's my code for the query:

name_data_df3 = movie_join_oscar_join_name.filter('Nazione in ("USA", "England", "France", "Canada", "Italy", "Austria")').select('knownForTitle','person').distinct()
art_pre_edge_data1 = name_data_df3.select(col('person').alias('artist1'), 'knownForTitle', )
art_pre_edge_data2 = art_pre_edge_data1.select(col('artista').alias('artist2'), col('knownForTitle').alias('knownForTitle2'))
art_edge_data = art_pre_edge_data1.join(art_pre_edge_data2, (art_pre_edge_data1.knownForTitle==art_pre_edge_data2.knownForTitle2) & (art_pre_edge_data1.artista!=art_pre_edge_data2.artista2))
art_edge_data = art_edge_data.drop(art_pre_edge_data2.knownForTitle2)
art_edge_data.select('*').show()

and here is the result:

+-----------------+-------------+-----------------+
|          artist1|knownForTitle|         artist2|
+-----------------+-------------+-----------------+
|   Robert Redford|    tt0070735|  George Roy Hill|
|  George Roy Hill|    tt0070735|   Robert Redford|
| Lionel Barrymore|    tt0022958|    Wallace Beery|
| Lionel Barrymore|    tt0022958|    Joan Crawford|
|    Wallace Beery|    tt0022958| Lionel Barrymore|
|    Wallace Beery|    tt0022958|    Joan Crawford|
|    Joan Crawford|    tt0022958| Lionel Barrymore|
|    Joan Crawford|    tt0022958|    Wallace Beery|

how can I remove the duplicates?

Upvotes: 3

Views: 8917

Answers (2)

pault
pault

Reputation: 43494

Here's a way to do it using DataFrame functions. Compare the two columns alphabetically and assign values such that artist1 will always sort lexicographically before artist2. Then select the distinct rows.

import pyspark.sql.functions as f

df.select(
    'knownForTitle',
    f.when(f.col('artist1') < f.col('artist2'), f.col('artist1')).otherwise(f.col('artist2')).alias('artist1'),
    f.when(f.col('artist1') < f.col('artist2'), f.col('artist2')).otherwise(f.col('artist1')).alias('artist2'),
).distinct().show()
#+-------------+----------------+----------------+
#|knownForTitle|         artist1|         artist2|
#+-------------+----------------+----------------+
#|    tt0070735| George Roy Hill|  Robert Redford|
#|    tt0022958|   Joan Crawford|Lionel Barrymore|
#|    tt0022958|   Joan Crawford|   Wallace Beery|
#|    tt0022958|Lionel Barrymore|   Wallace Beery|
#+-------------+----------------+----------------+

Upvotes: 1

pauli
pauli

Reputation: 4291

You can create a udf and do something like this

func_udf = f.udf(lambda x,y: sorted([x,y]), t.ArrayType(t.StringType()))

df2 = df.withColumn('artists', func_udf('artist1','artist2')).select(['artists',
                             'KnownForTitle']).distinct().show()

Upvotes: 0

Related Questions