How to identify columns based on datatype and convert them in pyspark?

Question

I have a dataframe like as shown below

df = pd.DataFrame({
'date':['11/12/2001','11/12/2002','11/12/2003','11/12/2004','11/12/2005','11/12/2006'],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_4','READ_5'],
 'val_date' :['21/12/2001','22/12/2002','23/12/2003','24/12/2004','25/12/2005','26/12/2006'],
 })
spark_df = spark.createDataFrame(df)
spark_df = spark_df.withColumn("date", spark_df["date"].cast(TimestampType()))
spark_df = spark_df.withColumn("val_date", spark_df["val_date"].cast(TimestampType()))

I have a dataframe with column datatypes as shown above

What I would like to do is identify

a) columns that contains the terms date,time in their names and convert their datatype from Timestamp/Datetime to string

and

b) identify columns based on Timestamp or Datetime datatypes and convert them to string type

Though the below works, this isn't elegant and efficient. I have more than 3k columns and cannot do this line by line

spark_df = spark_df.withColumn("date", spark_df["date"].cast(StringType()))
spark_df = spark_df.withColumn("val_date", spark_df["val_date"].cast(StringType()))

I also tried below but doesn't help

selected = [c.cast(StringType()) for c in spark_df.columns if ('date') in c]+['time']
spark_df.select(selected)

Is there anyway to identify columns based on condition a and b given above and convert them all at once?

Your inputs to solve this problem with at least one of the approach would be helpful

pissall · Accepted Answer

You can do the following:

from pyspark.sql.functions import col

schema = {col: col_type for col, col_type in df.dtypes}
time_cols = [col for col, col_type in schema.items() if col_type in "timestamp date".split() or "date" in col or "time" in col]

for column in time_cols:
    df = df.withColumn(column, col(column).cast("string"))

How to identify columns based on datatype and convert them in pyspark?

Answers (1)

Related Questions