rye
rye

Reputation: 497

How do I order fields of my Row objects in Spark (Python)

I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object like the following:

Row(bar=2, foo=1)

When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.

I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

Upvotes: 14

Views: 12948

Answers (3)

zero323
zero323

Reputation: 330063

Spark >= 3.0

Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:

PYSPARK_ROW_FIELD_SORTING_ENABLED=true 

Spark < 3.0

But is there any way to prevent the Row object from ordering them?

There isn't. If you provide kwargs arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.

Just use plain tuples:

rdd = sc.parallelize([(1, 2)])

and pass the schema as an argument to RDD.toDF (not to be confused with DataFrame.toDF):

rdd.toDF(["foo", "bar"])

or createDataFrame:

from pyspark.sql.types import *

spark.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])

spark.createDataFrame(rdd, schema)

You can also use namedtuples:

from collections import namedtuple

FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])

Finally you can sort columns by select:

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")

Upvotes: 15

bloodrootfc
bloodrootfc

Reputation: 1223

How to sort your original schema to match the alphabetical order of the RDD:

schema_sorted = StructType()
structfield_list_sorted = sorted(df.schema, key=lambda x: x.name)
for item in structfield_list_sorted:
    schema_sorted.add(item)

Upvotes: 3

Patrick  Z
Patrick Z

Reputation: 2339

From documentation:

Row also can be used to create another Row like class, then it could be used to create Row objects

In this case order of columns is saved:

>>> FooRow = Row('foo', 'bar')
>>> row = FooRow(1, 2)
>>> spark.createDataFrame([row]).dtypes
[('foo', 'bigint'), ('bar', 'bigint')]

Upvotes: 3

Related Questions