Pyspark UDF generating different values for a column despite only being called once

Question

I have a UDF that I am using to generate a column that contains UUIDs. My large data frame (dfFull) has the UDF define that column (called id) and then a smaller subset of data is broken off into df3. I need both dfFull and df3 to have the same ID's yet for some reason when the code runs, they have totally different values. Here is the code:

from pyspark.sql import functions as sql_functions

uuidUdf = sql_functions.udf(lambda: str(uuid.uuid4()), StringType())

def mail_file_reformat(df, filename):
    dfFull = df.withColumn("id", uuidUdf())

    dfFull = dfFull.drop_duplicates()
    
    df3 = dfFull

    df3 = df3.select([list of columns here])

    df3 = df3.drop_duplicates()
    df3 = df3.drop_duplicates(['im_bar_number'])

    dfFull = dfFull.select([Reordered list here])

    return df3, dfFull

I think it's because the uuidUdf is being called separately, but I am struggling to think of a way to get these dataframes to have the same ids

DBA108642 · Accepted Answer

Turns out I just needed to cache dfFull first:

def mail_file_reformat(df, filename):
    dfFull = df.withColumn("id", uuidUdf())

    dfFull = dfFull.drop_duplicates()
    dfFull.cache()
    df3 = dfFull

    df3 = df3.select([list of columns here])

    df3 = df3.drop_duplicates()
    df3 = df3.drop_duplicates(['im_bar_number'])

    dfFull = dfFull.select([Reordered list here])

    return df3, dfFull

Pyspark UDF generating different values for a column despite only being called once

Answers (1)

Related Questions