Tim4812
Tim4812

Reputation: 11

spark sql connect by prior

It seems like spark still doesn't support "connect by prior". Please let me know if there's any workaround for that -

Current Input

ColA , ColB 
D       E
A       B 
C       D
B       C

Required output -

ColA , ColB 
A       B
B       C 
C       D
D       E

If there any solution through spark SQL , please let me know

Upvotes: 1

Views: 4254

Answers (2)

Tim C.
Tim C.

Reputation: 179

See at spark-connectby from Pypi. at https://pypi.org/project/pyspark-connectby/

You now can do connectby on spark dataframe with the library, e.g:

from pyspark_connectby import connectBy
from pyspark.sql import SparkSession

schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
        [11, 1, 'John'],
        [111, 11, 'Jorge'],
        [112, 11, 'Kwaku'],
        [113, 11, 'Liu']
        ]
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()

Output is:

+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
+------+----------+-----+-----------------+----------+------+
|     1|         1|    1|            false|      null|Carlos|
|    11|         1|    2|            false|         1|  John|
|   111|         1|    3|             true|        11| Jorge|
|   112|         1|    3|             true|        11| Kwaku|
|   113|         1|    3|             true|        11|   Liu|
+------+----------+-----+-----------------+----------+------+

Upvotes: 0

Ged
Ged

Reputation: 18108

There is, but it is painful. It's too long to type out, but here is someone who did it.

http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/

My advice, not typical Spark processing, do in ORACLE or DB2 and sqoop the results in or read via DF Read via JDBC.

Via pregel as well https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/

Upvotes: 3

Related Questions