Reputation: 105
I'm new in pyspark.
I want to do some column transforms.
My dataframe:
import pandas as pd
df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC'))
df:
A B C
0 10 8 9
1 3 5 4
2 1 3 9
3 1 5 3
4 2 8 10
5 8 7 9
In df
, each row
is a triangulation, columns 'ABC'
are the vertex index of the triangulations.
I want to get the dataframe of all the triangles' edges.
Under conditions:
[8, 9]
and edge[9, 8]
are seen as same edge, only remain [8,9]
. (always lesser vertex index first)My desire dataframe edge_df
:
1 3
1 5
1 9
2 8
2 10
3 4
3 5
3 9
4 5
7 8
7 9
8 9
8 10
9 10
I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB'
, then distinct()
, and drop()
the lesser vertex index on the right column.
Is there any way more effective?
Upvotes: 0
Views: 165
Reputation: 13541
I think in this case, explode is good. orderby
is not good, but I added it for the desired output
from pyspark.sql import functions as f
df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \
.select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \
.show(truncate=False)
+---+---+
|a |b |
+---+---+
|1 |3 |
|1 |5 |
|1 |9 |
|2 |8 |
|2 |10 |
|3 |5 |
|3 |9 |
|7 |8 |
|7 |9 |
|8 |9 |
|8 |10 |
|9 |10 |
+---+---+
Upvotes: 1