Reputation: 73
I have a dataframe that looks like this:
preds.take(1)
[Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))]
I want the whole thing to be one row, without the nested row in there. So, the first value would get a name and be a part of the one row object. If I wanted to name it "ID", it would look like this:
preds.take(1)
[Row(ID=0, val1=False, val2=1, val3='high_school')]
I've tried various things within a map, but nothing is producing what I'm looking for (or getting errors). I've tried:
preds.map(lambda point: (point._1, point._2))
preds.map(lambda point: point._2.append(point._1))
preds.map(lambda point: point._2['ID']=point._1)
preds.map(lambda point: (point._2).ID=point._1)
Upvotes: 2
Views: 3687
Reputation: 330073
Since Row
is a tuple
and tuples
are immutable you can only create a new object. Using plain tuples:
from pyspark.sql import Row
r = Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))
r[:1] + r[1]
## (0, False, 1, 'high_school')
or preserving __fields__
:
Row(*r.__fields__[:1] + r[1].__fields__)(*r[:1] + r[1])
## Row(_1=0, val1=False, val2=1, val3='high_school')
In practice operating directly on rows should should be avoided in favor of using DataFrame
DSL without fetching data to Python interpreter:
df = sc.parallelize([r]).toDF()
df.select("_1", "_2.val1", "_2.val2", "_2.val3")
Upvotes: 2