Ravali
Ravali

Reputation: 79

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.

@udf('string')
def get_duplicates_cols(df, df_cols):
    duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
    for i in duplicate_col_index:
      df_cols[i] = df_cols[i] + '_duplicated'
      df2 = df.toDF(*df_cols)
    cols_to_remove = [c for c in df_cols if '_duplicated' in c]
    return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())

Upvotes: 0

Views: 371

Answers (2)

ggeop
ggeop

Reputation: 1375

You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as @Santiago P said you can use checkDuplicate ONLY

    def checkDuplicate(df):
        return len(set(df.columns)) == len(df.columns) 

Upvotes: 2

Santiago P
Santiago P

Reputation: 101

Assuming that you pass the data frame to the function.

udf(returnType=BooleanType())
    def checkDuplicate(df):
        return len(set(df.columns)) == len(df.columns)

Upvotes: 0

Related Questions