Reputation: 2954
I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates
method to work. It keeps returning the error:
"AttributeError: 'list' object has no attribute 'dropDuplicates'"
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
Upvotes: 29
Views: 172834
Reputation: 1869
In summary, distinct()
and dropDuplicates()
methods remove duplicates with one difference, which is essential.
dropDuplicates()
is more suitable by considering only a subset of the columns
data = [("James","","Smith","36636","M",60000),
("James","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Maria","Mary","Brown","","F",0)]
columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)
df.groupBy('first_name').agg(count(
'first_name').alias("count_duplicates")).filter(
col('count_duplicates') >= 2).show()
df.dropDuplicates(['first_name']).show()
# output
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob |gender|salary|
+----------+-----------+---------+-----+------+------+
|James | |Smith |36636|M |60000 |
|James |Rose | |40288|M |70000 |
|Robert | |Williams |42114| |400000|
|Maria |Anne |Jones |39192|F |500000|
|Maria |Mary |Brown | |F |0 |
+----------+-----------+---------+-----+------+------+
+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
| James| 2|
| Maria| 2|
+----------+----------------+
+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name| dob|gender|salary|
+----------+-----------+---------+-----+------+------+
| James| | Smith|36636| M| 60000|
| Maria| Anne| Jones|39192| F|500000|
| Robert| | Williams|42114| |400000|
+----------+-----------+---------+-----+------+------+
Upvotes: 2
Reputation: 5055
if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):
count before dedupe:
df.count()
do the de-dupe (convert the column you are de-duping to string type):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
Upvotes: 25
Reputation: 330063
It is not an import problem. You simply call .dropDuplicates()
on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...)
is pyspark.sql.dataframe.DataFrame
, after you apply .collect()
it is a plain Python list
, and lists don't provide dropDuplicates
method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
Upvotes: 49