Reputation: 997
I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is based on strings so I am trying to convert any datatype other than string to string. I am facing most of the issues in date columns conversion since date format need to be changed before converting into string to make it consistent for hash based matching.Please help me with the approach.
#Identify the fields which are not strings
from pyspark.sql.types import *
fields = df_db1.schema.fields
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
#Convert the date fields to specific date format and convert to string.
DateFields = map(lambda f: col(f.name), filter(lambda f: isistance(f.dataType, DateType), fields))
#convert all other fields other than string to string.
Upvotes: 0
Views: 2582
Reputation: 16086
For numeric and date fields you can use cast
#filter rows
DateFields = filter(lambda f: isinstance(f.dataType, DateType), fields)
# cast to string
dateFieldsWithCast = map(lambda f: col(f).cast("string").as(f.name), DateFields)
In analogic way you may create list of columns with Long type, etc. and then do select
as in this answer
Upvotes: 1