PySpark making dataframe with three columns from RDD with tuple and int

Question

I have a RDD in a form of:

[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]

What I have done is

df = spark.createDataFrame(rdd, ["src", "rp"])

where I make a column of the tuple and int which looks like this:

+-------+-----+
|    src|rp   |
+-------+-----+
|[1, 10]|    1|
|[10, 1]|    1|
|[1, 12]|    1|
|[12, 1]|    1|
+-------+-----+

But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:

+-------+-----+-----+
|    src|dst  |rp   |
+-------+-----+-----+
|      1|   10|    1|
|     10|    1|    1|
|      1|   12|    1|
|     12|    1|    1|
+-------+-----+-----+

ernest_k · Accepted Answer

You need an intermediate transformation on your RDD to make it a flat list of three elements:

spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])

+---+---+---+
|src|dst| rp|
+---+---+---+
|  1| 10|  1|
| 10|  1|  1|
|  1| 12|  1|
| 12|  1|  1|
+---+---+---+

PySpark making dataframe with three columns from RDD with tuple and int

Answers (2)

Related Questions