Antonius
Antonius

Reputation: 77

Concatenate columns in dataframe and order them based on number

I've created a dataframe in which I paired to IDs with eachother. E.g.

|First ID|Second ID|Concatenated column|
|--------|---------|-------------------|
|1111111111|2222222222|1111111111,2222222222|
|1111111111|1111111112|1111111111,1111111112|
|2222222222|1111111111|2222222222,1111111111|

What I want to do is remove all duplicate pairs, whether they should, regardless of the what the value of the 1st and the 2nd ID is.

What I figured as the best way to do this, is create a concatenate column in which both put both the IDs in, using the same order for each row/pair. i.e. ascending

Is this indeed the best way to remove all duplicate pairs (regardless of the ID that comes in first)? And if so, how can I concatenate the column in specified order (ascending/descending).

So far, I only found a way to concatenate the two columns, using this code:

df= df.withColumn('joined_IDs', 
                sf.concat(sf.col('Id_1'),sf.lit(', '), sf.col('Id_2')))

If there's any more sophisticated way to remove pairs, that would also be great to know!

Thanks!

Upvotes: 0

Views: 564

Answers (1)

mck
mck

Reputation: 42352

Use array_sort and array. If you want to remove duplicates, use dropDuplicates.

df.show()
+---+---+
|id1|id2|
+---+---+
|111|222|
|111|112|
|222|111|
+---+---+

import pyspark.sql.functions as F
df2 = df.withColumn('ids', F.array_sort(F.array('id1', 'id2')))
df2.show()
+---+---+----------+
|id1|id2|       ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
|222|111|[111, 222]|
+---+---+----------+

df3 = df2.dropDuplicates(['ids'])
df3.show()
+---+---+----------+
|id1|id2|       ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
+---+---+----------+

Upvotes: 2

Related Questions