Reputation: 1117
I have a PySpark pipeline RDD like bellow
(1,([1,2,3,4],[5,3,4,5])
(2,([1,2,4,5],[4,5,6,7])
I want to generate Data Frame like below:
Id sid cid
1 1 5
1 2 3
1 3 4
1 4 5
2 1 4
2 2 5
2 4 6
2 5 7
Please help me on this.
Upvotes: 1
Views: 2132
Reputation: 35249
If you have an RDD like this one,
rdd = sc.parallelize([
(1, ([1,2,3,4], [5,3,4,5])),
(2, ([1,2,4,5], [4,5,6,7]))
])
I would just use RDDs:
rdd.flatMap(lambda rec:
((rec[0], sid, cid) for sid, cid in zip(rec[1][0], rec[1][1]))
).toDF(["id", "sid", "cid"]).show()
# +---+---+---+
# | id|sid|cid|
# +---+---+---+
# | 1| 1| 5|
# | 1| 2| 3|
# | 1| 3| 4|
# | 1| 4| 5|
# | 2| 1| 4|
# | 2| 2| 5|
# | 2| 4| 6|
# | 2| 5| 7|
# +---+---+---+
Upvotes: 2