patpat
patpat

Reputation: 207

create a column that group values

To resume i want to put into the same group values that are associated:

Here is what i have :

col1    col2
1        2
1        3
2        3
4        5
5        6

and I want this :

col1    col2    group
1        2        1
1        3        1
2        3        1
4        5        2
5        6        2

To produce those two groups here are the steps if i do it manually.

Do you have an idea of to resolve this in SQL. Knowing that i am using Hive or pyspark

Upvotes: 0

Views: 78

Answers (1)

patpat
patpat

Reputation: 207

Based on A.R.Ferguson answer i was able to figure out the solution using pyspark and graphframe:

from graphframes import *
vertices = sqlContext.createDataFrame([
  ("A",  1),
  ("B",  2),
  ("C",  3),
  ("D",  4),
  ("E",  5),
  ("F",  6)], ["name",  "id"])
edges = sqlContext.createDataFrame([
  (1, 2),
  (1, 3),
  (2, 3),
  (4, 5),
  (5, 6)], ["src", "dst"])
g = GraphFrame(vertices, edges)
result = g.connectedComponents()
result.show()

Thanks again Ferguson.

Upvotes: 1

Related Questions