chetan
chetan

Reputation: 125

Pyspark dataframe data type for all columns changed to String by UDF

I have a dataframe which have few columns like email_address(String), paid(integer) and date(datetime type)

I'm running a UDF as mentioned below:

from pyspark.sql.functions import udf, col

def conv(column):
    date_format='%m/%d/%Y'
    a = None
    if column:
        try:
            a= datetime.strptime(str(column),'%Y-%m-%d').strftime(date_format)
            print("Inside Try")
        except:
            a = column
            print("Inside except")
    return a

conv_func = udf(conv)

df_new = date_df.select(*(conv_func(col(c)).alias(c) for c in date_df.columns))

So in df_new I'm expecting email_address(String), paid(integer) and date(string) with changed format from YYYY-MM-DD to MM/DD/YYYY .

The problem is UDF does converts the format but also changes the datatype of paid column to string from integer which I didn't anticipated for.

I wonder how to avoid that using this UDF.

Upvotes: 0

Views: 3703

Answers (2)

Suresh
Suresh

Reputation: 5880

To change the format of date columns, you can use date_format from pyspark sql functions. I created example data and tested,

 >>> for pyspark.sql import functions as F 
 >>> l=[('2018-01-22','id1',123,'2018-01-21'),('2018-01-22','id2',234,'2018-01-21'),('2018-01-22','id3',345,'2018-01-21'),('2018-01-22','id2',456,'2018-01-21')]
 >>> df = spark.createDataFrame(l,['date1','id','value','date2'])
 >>> df = df.select(df.date1.cast('date'),'id','value',df.date2.cast('date'))
 >>> df.printSchema()
 root
  |-- date1: date (nullable = true)
  |-- id: string (nullable = true)
  |-- value: long (nullable = true)
  |-- date2: date (nullable = true)

 >>> df.show()
 +----------+---+-----+----------+
 |     date1| id|value|     date2|
 +----------+---+-----+----------+
 |2018-01-22|id1|  123|2018-01-21|
 |2018-01-22|id2|  234|2018-01-21|
 |2018-01-22|id3|  345|2018-01-21|
 |2018-01-22|id2|  456|2018-01-21|
 +----------+---+-----+----------+
 >>> dcols,cols = [],[]
 >>> for x in df.schema.fields:
 ...     if repr(x.dataType) == 'DateType':
 ...        dcols.append(x.name)
 ...     else:
 ...        cols.append(x.name)
 ...
 >>> dcols
 ['date1', 'date2']
 >>> cols
 ['id', 'value']
 >>> df.select([F.date_format(c,'MM/dd/yyy').alias('%s'%c) for c in dcols]+cols).show()
 +----------+----------+---+-----+
 |     date1|     date2| id|value|
 +----------+----------+---+-----+
 |01/22/2018|01/21/2018|id1|  123|
 |01/22/2018|01/21/2018|id2|  234|
 |01/22/2018|01/21/2018|id3|  345|
 |01/22/2018|01/21/2018|id2|  456|
 +----------+----------+---+-----+

 ## If you still want to use UDF

 >>> from datetime import datetime
 >>> def conv(column):
 ...     date_format='%m/%d/%Y'
 ...     a = datetime.strptime(str(column),'%Y-%m-%d').strftime(date_format)
 ...     return a
 ...
 >>> conv_func = F.udf(conv)
 >>> df.select([conv_func(F.col(x)).alias('%s'%x) for x in dcols]+cols).show()
 +----------+----------+---+-----+
 |     date1|     date2| id|value|
 +----------+----------+---+-----+
 |01/22/2018|01/21/2018|id1|  123|
 |01/22/2018|01/21/2018|id2|  234|
 |01/22/2018|01/21/2018|id3|  345|
 |01/22/2018|01/21/2018|id2|  456|
 +----------+----------+---+-----+

Hope this helps.

Upvotes: 1

pault
pault

Reputation: 43544

Here is one way to do this in the way you are trying.

Use a function to test if a column is a date and only apply your conversion on those.

from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType

def conv(column):
    date_format='%m/%d/%Y'
    a = datetime.strptime(str(column),'%Y-%m-%d').strftime(date_format)
    return a

def is_date(column):
    try:
        conv(column)
        return True
    except:
        return False

conv_func = udf(conv)
date_udf = udf(is_date, BooleanType())

df_new = date_df.select(
    *(conv_func(col(c)).alias(c) if date_udf(col(c)) else col(c) for c in date_df.columns)
)

I haven't tested this (pseudo)code (It would have helped if you have provided an MCVE), but I think something like this approach should work.

Upvotes: 1

Related Questions