eemilk
eemilk

Reputation: 1638

PySpark making dataframe with three columns from RDD with tuple and int

I have a RDD in a form of:

[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]

What I have done is

df = spark.createDataFrame(rdd, ["src", "rp"])

where I make a column of the tuple and int which looks like this:

+-------+-----+
|    src|rp   |
+-------+-----+
|[1, 10]|    1|
|[10, 1]|    1|
|[1, 12]|    1|
|[12, 1]|    1|
+-------+-----+

But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:

+-------+-----+-----+
|    src|dst  |rp   |
+-------+-----+-----+
|      1|   10|    1|
|     10|    1|    1|
|      1|   12|    1|
|     12|    1|    1|
+-------+-----+-----+

Upvotes: 1

Views: 233

Answers (2)

user238607
user238607

Reputation: 2478

You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.

from pyspark.sql.functions import col    
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()

Here's the result

+---+---+---+
|src|dst| rp|
+---+---+---+
|  1| 10|  1|
| 10|  1|  1|
|  1| 12|  1|
| 12|  1|  1|
+---+---+---+

Upvotes: 1

ernest_k
ernest_k

Reputation: 45339

You need an intermediate transformation on your RDD to make it a flat list of three elements:

spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
|  1| 10|  1|
| 10|  1|  1|
|  1| 12|  1|
| 12|  1|  1|
+---+---+---+

Upvotes: 2

Related Questions