John
John

Reputation: 1167

Iterating List of SQL.Row with PySpark

I have a Spark.SQL.Row that looks something like this:

my_row = Row(id = 1,
    value = [Row(id = 1, value = "value1"), Row(id = 2, value = "value2")])

I'd like to get the value from each of the nested rows using something like:

[x.value for x in my_row.value]

The problem is that when I iterate, the entire row is converted into tuples,

my_row = (1, [(1, "value1"), (2, "value2")])

and I lose the schema. Is there a way to iterate and retain the schema for the list of rows?

Upvotes: 1

Views: 3634

Answers (1)

zero323
zero323

Reputation: 330413

To be precise pyspark.sql.row is actually a tuple:

isinstance(my_row, tuple)
# True

Since Python tuples are immutable the only option I see is rebuild Row from scratch:

d = my_row.asDict()
d["value"] = [Row(value=x.value) for x in  my_row.value]
Row(**d)

## Row(id=1, value=[Row(value='value1'), Row(value='value2')])

Upvotes: 2

Related Questions