Reputation: 1638
I have a RDD in a form of:
[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]
What I have done is
df = spark.createDataFrame(rdd, ["src", "rp"])
where I make a column of the tuple and int which looks like this:
+-------+-----+
| src|rp |
+-------+-----+
|[1, 10]| 1|
|[10, 1]| 1|
|[1, 12]| 1|
|[12, 1]| 1|
+-------+-----+
But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:
+-------+-----+-----+
| src|dst |rp |
+-------+-----+-----+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+-------+-----+-----+
Upvotes: 1
Views: 233
Reputation: 2478
You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()
Here's the result
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+
Upvotes: 1
Reputation: 45339
You need an intermediate transformation on your RDD to make it a flat list of three elements:
spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+
Upvotes: 2