simonzack
simonzack

Reputation: 20928

Remove similar duplicates from dataframe

How would I remove rows in a dataframe whose values are within a certain threshold?

                x             y
1   -0.111111e-15  0.111111e-15
2   -1.111112e-15  1.111112e-15
3   -1.111111e-15  1.111111e-15

For example if I set the threshold to 1e^-8, the second or third row will be removed.

Upvotes: 2

Views: 99

Answers (3)

Steven Beaupré
Steven Beaupré

Reputation: 21621

Similar approach using dplyr that would work on both data.table or data.frame

dfrm<-data.frame(id=letters[1:3],x=c(-1/9/1e15,-1/9/1e14,-1/9/1e14),
                 y=c(1/9/1e15,1/9/1e14,1/9/1e14))

library(dplyr)
dfrm %>%
  # select only numeric columns
  select(which(sapply(., is.numeric))) %>%
  # remove rows
  slice(which(!duplicated(round(., -8)))) %>%
  # right join the result with original dataset (get back unselected non-numeric columns)
  right_join(dfrm, .)

Upvotes: 2

MichaelChirico
MichaelChirico

Reputation: 34703

Here's a possible data.table (if your data is now a data.frame df, just set dt<-data.table(df)).

A more complicated version of your data, with non numeric columns:

library(data.table)
dt <- data.table(id=letters[1:3],
                 x=c(-1/9/1e15,-1/9/1e14,-1/9/1e14),
                 y=c(1/9/1e15,1/9/1e14,1/9/1e14))

Now we just round all the numeric columns to your threshold and find unique rows:

indx <- names(dt)[sapply(dt, is.numeric)]  ## Find numeric columns
unique(dt[, lapply(.SD, round, 8), .SDcols = indx])
#    x y
# 1: 0 0

Alternatively, you can keep both the numeric and non-numeric columns while subsetting only by the numeric columns

unique(dt[, (indx) := lapply(.SD, round, 8), .SDcols = indx], by = indx)
#    id x y
# 1:  a 0 0

Upvotes: 4

IRTFM
IRTFM

Reputation: 263301

I input console output with a little utility function rd.txt:

> dat <- rd.txt("                x             y
+ 1   -0.111111e-15  0.111111e-15
+ 2   -1.111112e-15  1.111112e-15
+ 3   -1.111111e-15  1.111111e-15"
+ )
> dat[ ! duplicated( round(dat,-8) ),]
             x           y
1 -1.11111e-16 1.11111e-16

(My first version with a minus sign rather than a negation operator was not correct.) This would need some modifications if all the columns were not numeric. If tht's the case then please post a proper test example, preferably with dput()-output rather than console output which is often ambiguous.

With the example from the other respondent (modified to deliver the requested object class):

dfrm<-data.frame(id=letters[1:3],x=c(-1/9/1e15,-1/9/1e14,-1/9/1e14),
               y=c(1/9/1e15,1/9/1e14,1/9/1e14))
dfrm[ ! duplicated( round( dfrm[ , sapply(dfrm, is.numeric)],8)), ]
  id             x            y
1  a -1.111111e-16 1.111111e-16

Upvotes: 5

Related Questions