Reputation: 326
I am trying to remove duplicates from data-frame but first entry should not be removed . excluding first record rest all other duplicates should get stored in one separate data-frame .
for e.g if data-frame is like :
col1,col2,col3,col4
r,t,s,t
a,b,c,d
b,m,c,d
a,b,c,d
a,b,c,d
g,n,d,f
e,f,g,h
t,y,u,o
e,f,g,h
e,f,g,h
in such case I should have two data-frames .
df1:
r,t,s,t
a,b,c,d
b,m,c,d
g,n,d,f
e,f,g,h
t,y,u,o
and other data-frame should be :
a,b,c,d
a,b,c,d
e,f,g,h
e,f,g,h
Upvotes: 2
Views: 7342
Reputation: 31460
Try using window row_number()
function.
Example:
df.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| r| t| s| t|
#| a| b| c| d|
#| b| m| c| d|
#| a| b| c| d|
#| a| b| c| d|
#| g| n| d| f|
#| e| f| g| h|
#| t| y| u| o|
#| e| f| g| h|
#| e| f| g| h|
#+----+----+----+----+
from pyspark.sql import *
from pyspark.sql.functions import *
w=Window.partitionBy("col1","col2","col3","col4").orderBy(lit(1))
df1=df.withColumn("rn",row_number().over(w)).filter(col("rn")==1).drop("rn")
df1.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| b| m| c| d|
#| r| t| s| t|
#| g| n| d| f|
#| t| y| u| o|
#| a| b| c| d|
#| e| f| g| h|
#+----+----+----+----+
df2=df.withColumn("rn",row_number().over(w)).filter(col("rn")>1).drop("rn")
df2.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| a| b| c| d|
#| a| b| c| d|
#| e| f| g| h|
#| e| f| g| h|
#+----+----+----+----+
Upvotes: 4