Reputation: 361
I need to show a dataframe made by three columns. Two of them show the names of someone who worked in a common movie (indicated by code on the third code)
here's my code for the query:
name_data_df3 = movie_join_oscar_join_name.filter('Nazione in ("USA", "England", "France", "Canada", "Italy", "Austria")').select('knownForTitle','person').distinct()
art_pre_edge_data1 = name_data_df3.select(col('person').alias('artist1'), 'knownForTitle', )
art_pre_edge_data2 = art_pre_edge_data1.select(col('artista').alias('artist2'), col('knownForTitle').alias('knownForTitle2'))
art_edge_data = art_pre_edge_data1.join(art_pre_edge_data2, (art_pre_edge_data1.knownForTitle==art_pre_edge_data2.knownForTitle2) & (art_pre_edge_data1.artista!=art_pre_edge_data2.artista2))
art_edge_data = art_edge_data.drop(art_pre_edge_data2.knownForTitle2)
art_edge_data.select('*').show()
and here is the result:
+-----------------+-------------+-----------------+
| artist1|knownForTitle| artist2|
+-----------------+-------------+-----------------+
| Robert Redford| tt0070735| George Roy Hill|
| George Roy Hill| tt0070735| Robert Redford|
| Lionel Barrymore| tt0022958| Wallace Beery|
| Lionel Barrymore| tt0022958| Joan Crawford|
| Wallace Beery| tt0022958| Lionel Barrymore|
| Wallace Beery| tt0022958| Joan Crawford|
| Joan Crawford| tt0022958| Lionel Barrymore|
| Joan Crawford| tt0022958| Wallace Beery|
how can I remove the duplicates?
Upvotes: 3
Views: 8917
Reputation: 43494
Here's a way to do it using DataFrame functions. Compare the two columns alphabetically and assign values such that artist1
will always sort lexicographically before artist2
. Then select the distinct rows.
import pyspark.sql.functions as f
df.select(
'knownForTitle',
f.when(f.col('artist1') < f.col('artist2'), f.col('artist1')).otherwise(f.col('artist2')).alias('artist1'),
f.when(f.col('artist1') < f.col('artist2'), f.col('artist2')).otherwise(f.col('artist1')).alias('artist2'),
).distinct().show()
#+-------------+----------------+----------------+
#|knownForTitle| artist1| artist2|
#+-------------+----------------+----------------+
#| tt0070735| George Roy Hill| Robert Redford|
#| tt0022958| Joan Crawford|Lionel Barrymore|
#| tt0022958| Joan Crawford| Wallace Beery|
#| tt0022958|Lionel Barrymore| Wallace Beery|
#+-------------+----------------+----------------+
Upvotes: 1
Reputation: 4291
You can create a udf and do something like this
func_udf = f.udf(lambda x,y: sorted([x,y]), t.ArrayType(t.StringType()))
df2 = df.withColumn('artists', func_udf('artist1','artist2')).select(['artists',
'KnownForTitle']).distinct().show()
Upvotes: 0