Reputation: 497
I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.
row = Row(foo=1, bar=2)
Then it creates an object like the following:
Row(bar=2, foo=1)
When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.
I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?
Upvotes: 14
Views: 12948
Reputation: 330063
Spark >= 3.0
Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:
PYSPARK_ROW_FIELD_SORTING_ENABLED=true
Spark < 3.0
But is there any way to prevent the Row object from ordering them?
There isn't. If you provide kwargs
arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.
Just use plain tuples:
rdd = sc.parallelize([(1, 2)])
and pass the schema as an argument to RDD.toDF
(not to be confused with DataFrame.toDF
):
rdd.toDF(["foo", "bar"])
or createDataFrame
:
from pyspark.sql.types import *
spark.createDataFrame(rdd, ["foo", "bar"])
# With full schema
schema = StructType([
StructField("foo", IntegerType(), False),
StructField("bar", IntegerType(), False)])
spark.createDataFrame(rdd, schema)
You can also use namedtuples
:
from collections import namedtuple
FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])
Finally you can sort columns by select
:
sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
Upvotes: 15
Reputation: 1223
How to sort your original schema to match the alphabetical order of the RDD:
schema_sorted = StructType()
structfield_list_sorted = sorted(df.schema, key=lambda x: x.name)
for item in structfield_list_sorted:
schema_sorted.add(item)
Upvotes: 3
Reputation: 2339
From documentation:
Row also can be used to create another Row like class, then it could be used to create Row objects
In this case order of columns is saved:
>>> FooRow = Row('foo', 'bar')
>>> row = FooRow(1, 2)
>>> spark.createDataFrame([row]).dtypes
[('foo', 'bigint'), ('bar', 'bigint')]
Upvotes: 3