Reputation: 2112
I have a UDF that I am using to generate a column that contains UUIDs. My large data frame (dfFull) has the UDF define that column (called id
) and then a smaller subset of data is broken off into df3
. I need both dfFull
and df3
to have the same ID's yet for some reason when the code runs, they have totally different values. Here is the code:
from pyspark.sql import functions as sql_functions
uuidUdf = sql_functions.udf(lambda: str(uuid.uuid4()), StringType())
def mail_file_reformat(df, filename):
dfFull = df.withColumn("id", uuidUdf())
dfFull = dfFull.drop_duplicates()
df3 = dfFull
df3 = df3.select([list of columns here])
df3 = df3.drop_duplicates()
df3 = df3.drop_duplicates(['im_bar_number'])
dfFull = dfFull.select([Reordered list here])
return df3, dfFull
I think it's because the uuidUdf is being called separately, but I am struggling to think of a way to get these dataframes to have the same ids
Upvotes: 1
Views: 174
Reputation: 2112
Turns out I just needed to cache dfFull first:
def mail_file_reformat(df, filename):
dfFull = df.withColumn("id", uuidUdf())
dfFull = dfFull.drop_duplicates()
dfFull.cache()
df3 = dfFull
df3 = df3.select([list of columns here])
df3 = df3.drop_duplicates()
df3 = df3.drop_duplicates(['im_bar_number'])
dfFull = dfFull.select([Reordered list here])
return df3, dfFull
Upvotes: 1