Reputation: 79
I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
@udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
Upvotes: 0
Views: 371
Reputation: 1375
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as @Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Upvotes: 2
Reputation: 101
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Upvotes: 0