Drizzle
Drizzle

Reputation: 105

Column transform in pyspark dataframe

I'm new in pyspark.

I want to do some column transforms.

My dataframe:

import pandas as pd
df = pd.DataFrame([[10,  8,  9], [ 3,  5,  4], [ 1,  3,  9], [ 1,  5,  3], [ 2,  8, 10], [ 8,  7,  9]],columns=list('ABC'))

df:

    A   B   C
0   10  8   9
1   3   5   4
2   1   3   9
3   1   5   3
4   2   8   10
5   8   7   9

In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.

I want to get the dataframe of all the triangles' edges.

Under conditions:

  1. For each edge, always lesser vertex index first.
  2. Remove duplicate edges.
  3. Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)

My desire dataframe edge_df:

1   3
1   5
1   9
2   8
2   10
3   4
3   5
3   9
4   5
7   8
7   9
8   9
8   10
9   10

I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.

Is there any way more effective?

Upvotes: 0

Views: 165

Answers (1)

Lamanus
Lamanus

Reputation: 13541

I think in this case, explode is good. orderby is not good, but I added it for the desired output

from pyspark.sql import functions as f

df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \
  .select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \
  .show(truncate=False)

+---+---+
|a  |b  |
+---+---+
|1  |3  |
|1  |5  |
|1  |9  |
|2  |8  |
|2  |10 |
|3  |5  |
|3  |9  |
|7  |8  |
|7  |9  |
|8  |9  |
|8  |10 |
|9  |10 |
+---+---+

Upvotes: 1

Related Questions