Reputation: 77
I've created a dataframe in which I paired to IDs with eachother. E.g.
|First ID|Second ID|Concatenated column|
|--------|---------|-------------------|
|1111111111|2222222222|1111111111,2222222222|
|1111111111|1111111112|1111111111,1111111112|
|2222222222|1111111111|2222222222,1111111111|
What I want to do is remove all duplicate pairs, whether they should, regardless of the what the value of the 1st and the 2nd ID is.
What I figured as the best way to do this, is create a concatenate column in which both put both the IDs in, using the same order for each row/pair. i.e. ascending
Is this indeed the best way to remove all duplicate pairs (regardless of the ID that comes in first)? And if so, how can I concatenate the column in specified order (ascending/descending).
So far, I only found a way to concatenate the two columns, using this code:
df= df.withColumn('joined_IDs',
sf.concat(sf.col('Id_1'),sf.lit(', '), sf.col('Id_2')))
If there's any more sophisticated way to remove pairs, that would also be great to know!
Thanks!
Upvotes: 0
Views: 564
Reputation: 42352
Use array_sort
and array
. If you want to remove duplicates, use dropDuplicates
.
df.show()
+---+---+
|id1|id2|
+---+---+
|111|222|
|111|112|
|222|111|
+---+---+
import pyspark.sql.functions as F
df2 = df.withColumn('ids', F.array_sort(F.array('id1', 'id2')))
df2.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
|222|111|[111, 222]|
+---+---+----------+
df3 = df2.dropDuplicates(['ids'])
df3.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
+---+---+----------+
Upvotes: 2