Reputation: 4159
Have a folder of parquet files that I am reading into a pyspark session. How can I inspect / parse the individual schema field types and other info (eg. for the purpose of comparing schemas between dataframes to see exact type differences)?
I can see the parquet schema and specific field names with something like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
sparkSession = SparkSession.builder.appName("data_debugging").getOrCreate()
df = sparkSession.read.option("header", "true").parquet("hdfs://hw.co.local:8020/path/to/parquets")
df.schema # or df.printSchema()
df.fieldNames()
So I can see the schema
StructType(List(StructField(SOME_FIELD_001,StringType,true),StructField(SOME_FIELD_002,StringType,true),StructField(SOME_FIELD_003,StringType,true)))
but not sure how to get the values for specific fields, eg. something like...
df.schema.getType("SOME_FIELD_001")
or
df.schema.getData("SOME_FIELD_001") #type: dict
Does anyone know how to do something like this?
Upvotes: 2
Views: 2212
Reputation: 663
You can use the df.dtypes
method to get the field name along with it's datatype and the same can be converted to a dict
object as shown below,
myschema = dict(df.dtypes)
Now, you can obtain the datatypes as shown below,
myschema.get('some_field_002')
Output:
'string'
Alternatively, if you want the datatypes as a pyspark.sql.types
object, you can use the df.schema
method and create a custom schema dictionary as shown below,
myschema = dict(map(lambda x: (x.name, x.dataType), df.schema.fields))
print(myschema.get('some_field_002'))
Output:
StringType
Upvotes: 0
Reputation: 8410
If name is specified as df, the metadata dict will be called df.meta
name=df #enter name of dataframe here
def metadata(name): #function for getting metadata in a dict
null=[str(n.nullable) for n in name.schema.fields] #nullability
types=[str(i.dataType) for i in name.schema.fields] #type
both = [list(a) for a in zip(types, null)]#combine type+nullability
names= name.columns #names of columns
final = {} #create dict
for key in names:
for value in both:
final[key] = value
both.remove(value)
break
return final
name.meta= metadata(name) # final dict is called df.meta
# if name=df2, final dict will be df2.meta
Input: df.meta
Output: {'col1': ['StringType', 'True'],
'col2': ['StringType', 'True'],
'col3': ['LongType', 'True'],
'col4': ['StringType', 'True']}
#get column info
Input: df.meta['col1']
Output: ['StringType', 'True']
#compare column type + nullability
Input: df.meta['col1'] == df2.meta['col1']
Ouput: True/False
#compare only column type
Input: df.meta['col1'][0] == df2.meta['col1'][0]
Output: True/False
#compare only nullability
Input: df.meta['col1'][1] == df2.meta['col1'][1]
Output: True/False
Upvotes: 1