R - Filter duplicate rows in large data frame

Question

I have a dataframe with 500k rows and about 130 columns. I want to filter out duplicate rows for all columns except one (column 128). I tried:

df <- unique(df[,-128])

df <- df[!duplicated(df[, -128])]

df <- distinct(df, -column128)

I get the same error over and over again:

Error in paste(...............,  : formal argument "sep" matched by multiple actual arguments

I also tried to type every single column out, but got the same error. If I try the above for the first 9 columns, the error doesn't appear. However, if I try the same for 10 columns, I get the error. Is there a limit on the number of columns for removing duplicated rows? Or has anyone a solution?

The df looks as follows (column 128 = label):

    data.frame':    571262 obs. of  139 variables:
 $ x                      : num  1 1 1 1 0 0 0 7 7 7 ...
 $ jan                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feb                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mrt                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ apr                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mei                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jun                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jul                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ aug                    : num  1 1 0 0 0 0 0 0 0 0 ...
 $ sep                    : num  0 0 1 1 0 0 0 0 0 0 ...
 $ okt                    : num  0 0 0 0 1 1 1 0 0 0 ...
 $ nov                    : num  0 0 0 0 0 0 0 1 1 1 ...
 $ dec                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ - 1                    : num  0 0 1 1 1 ...
 $ - 2                    : num  0 0 0 0 1 ...
 $ - 3                    : num  0 0 0 0 0 ...
 $ - 4                    : num  0 0 0 0 0 0 0 0 0 0 ...
......
 $ - 114                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ label                  : int  8 12 8 12 8 10 12 8 10 12 ...
 $ 2008                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2009                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2010                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2011                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2012                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2013                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2014                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2015                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2016                   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ 2017                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 2018                   : num  0 0 0 0 0 0 0 0 0 0 ...

smci · Accepted Answer

Seems like one of your month columns 'sep' is colliding with the argument paste(..., sep). The error is telling you 'formal argument "sep" matched by multiple actual arguments'.

Unlikely you have 2+ columns called 'sep' : check which(names(df)=='sep')

Workaround is to rename your column 'sep' to something else, e.g. 'spt'

R - Filter duplicate rows in large data frame

Answers (2)

Related Questions