Reputation: 11
It seems like spark still doesn't support "connect by prior". Please let me know if there's any workaround for that -
Current Input
ColA , ColB
D E
A B
C D
B C
Required output -
ColA , ColB
A B
B C
C D
D E
If there any solution through spark SQL , please let me know
Upvotes: 1
Views: 4254
Reputation: 179
See at spark-connectby
from Pypi. at https://pypi.org/project/pyspark-connectby/
You now can do connectby
on spark dataframe with the library, e.g:
from pyspark_connectby import connectBy
from pyspark.sql import SparkSession
schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
[11, 1, 'John'],
[111, 11, 'Jorge'],
[112, 11, 'Kwaku'],
[113, 11, 'Liu']
]
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()
Output is:
+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id| name|
+------+----------+-----+-----------------+----------+------+
| 1| 1| 1| false| null|Carlos|
| 11| 1| 2| false| 1| John|
| 111| 1| 3| true| 11| Jorge|
| 112| 1| 3| true| 11| Kwaku|
| 113| 1| 3| true| 11| Liu|
+------+----------+-----+-----------------+----------+------+
Upvotes: 0
Reputation: 18108
There is, but it is painful. It's too long to type out, but here is someone who did it.
http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/
My advice, not typical Spark processing, do in ORACLE or DB2 and sqoop the results in or read via DF Read via JDBC.
Via pregel as well https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/
Upvotes: 3