Spark DataFrame equivalent of pandas.DataFrame.set_index / drop_duplicates vs. dropDuplicates

Question

The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.

Does anyone know how to handle this situation?

Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?

Munesh · Accepted Answer

If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.

df = df.drop('p_index') // Pass column name to be dropped

df = df.select('name', 'age') // Pass the required columns

drop_duplicates() is an alias for dropDuplicates().

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

Spark DataFrame equivalent of pandas.DataFrame.set_index / drop_duplicates vs. dropDuplicates

Answers (1)

Related Questions