Reputation: 7421
The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT()
and COUNT(DISTINCT())
on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index
on that column.
Does anyone know how to handle this situation?
Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates
and dropDuplicates
. Are they the same?
Upvotes: 1
Views: 1211
Reputation: 1569
If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.
df = df.drop('p_index') // Pass column name to be dropped
df = df.select('name', 'age') // Pass the required columns
drop_duplicates() is an alias for dropDuplicates().
Upvotes: 2