Reputation: 3182
In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:
Schema = StructType([ StructField("temperature", DoubleType(), True),
StructField("temperature_unit", StringType(), True),
StructField("humidity", DoubleType(), True),
StructField("humidity_unit", StringType(), True),
StructField("pressure", DoubleType(), True),
StructField("pressure_unit", StringType(), True)
])
For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.
Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?
df.printSchema()
prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.
Upvotes: 46
Views: 136208
Reputation: 24356
DDL-formatted schema can be found using
df._jdf.schema().toDDL()
df.schema.simpleString()
# In this case, the outer `struct` wrapper is not part of the dataframe's schema.
Examples:
df = spark.createDataFrame([(1, [('name1', 'id1'), ('name2', 'id2')])])
df._jdf.schema().toDDL()
# _1 BIGINT,_2 ARRAY<STRUCT<_1: STRING, _2: STRING>>
df.schema.simpleString()
# struct<_1:bigint,_2:array<struct<_1:string,_2:string>>>
Upvotes: 1
Reputation: 188
Pyspark since version 3.3.0 return df.schema in python-way https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html#pyspark.sql.DataFrame.schema
>>> df.schema
StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])
Upvotes: 0
Reputation: 491
If you are looking for a DDL string from PySpark:
df: DataFrame = spark.read.load('LOCATION')
schema_json = df.schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
Upvotes: 12
Reputation: 994
The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.
from pyspark.sql.types import StructType
schema = [i for i in df.schema]
And then from here, you have your new schema:
NewSchema = StructType(schema)
Upvotes: 13
Reputation: 1353
You could re-use schema for existing Dataframe
l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
people_rdd=spark.sparkContext.parallelize(l)
schemaPeople = people_rdd.toDF(['name','age','gender'])
schemaPeople.show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
spark.createDataFrame(people_rdd,schemaPeople.schema).show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
Just use df.schema to get the underlying schema of dataframe
schemaPeople.schema
StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))
Upvotes: 9
Reputation:
Yes it is possible. Use DataFrame.schema
property
schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
>>> df.schema StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
New in version 1.3.
Schema can be also exported to JSON and imported back if needed.
Upvotes: 58