Reputation: 2876
I'm going to ingest data using databricks notebook. I want to validate the schema of the data ingested against what I'm expecting the schema of these data to be.
So basically I have:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
I've seen similar questions but answers are always in scala.
Upvotes: 2
Views: 5384
Reputation: 2334
another method , you can find the difference based on the simple python list
comparisons .
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
Screen print :
Upvotes: 2
Reputation: 87249
it's really depends on your exact requirements & complexities of schemas that you want to compare - for example, ignore nullability flag vs. taking it into account, order of columns, support for maps/structs/arrays, etc. Also, do you want to see difference or just a flag if schemas are matching or not.
In the simplest case it could be as simple as following - just compare string representations of schemas:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
I personally would recommend to take an existing library, like Chispa that has more advanced schema comparison functions - you can tune checks, it will show differences, etc. After installation (you can just do %pip install chispa
) - this will throw an exception if schemas are different:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)
Upvotes: 4