gruvn
gruvn

Reputation: 744

Finding rows with a unique combination of values (R)

This is a bit more complicated that the title lets on, and I'm sure if I could think of a way to better describe it, I could google it better.

I have data that looks like this:

SET                     ID    
100301006              1287025
100301006              1287026
100301010              1287027
100301013              1287030
100301011              1287027

and I would like to identify and select those rows where each both values in a row have a unique value for the column. In the example above, I want to grab only the row:

100301013              1287030

I don't want SET 100301006, since it matches to 2 different records in the ID field (1287025 and 1287026). Similarly, I don't want SET 100301010 since the ID record it matches to (1287027) can also match another SET (10030011).

In some cases there could be more than 2 matches.

I could do this in loops, but that seems like a hack. I'd love a base R or data.table solution, but I'm not so interested in dplyr (trying to minimize dependencies).

Upvotes: 4

Views: 941

Answers (5)

Anes kz
Anes kz

Reputation: 1

If we have a dataframe df and want to find unique values of columns: column1, column2, column3:

library(dplyr)
df <- df %>% group_by(column1,column2,column3) %>% summarise()

Upvotes: 0

IceCreamToucan
IceCreamToucan

Reputation: 28685

You could use data.table to select only groups with 1 row, grouping by ID first, then by SET. This is similar to @r2evans method of checking that the counts for ID and SET are both 1.

library(data.table)
setDT(df)

df[, if(.N == 1) .SD, ID][, if(.N == 1) .SD, SET]

#          SET      ID
# 1: 100301013 1287030

Or for more than 2 columns

Reduce(function(x, y) x[, if(.N == 1) .SD, y], names(df), init = df)
#         ID       SET
# 1: 1287030 100301013

Upvotes: 2

akrun
akrun

Reputation: 887108

We can use duplicated on each columns independently to create a list of logical vectors, Reduce it to a single vector with & and use that to subset the rows of the dataset

df1[Reduce(`&`, lapply(df1, function(x) 
         !(duplicated(x)|duplicated(x, fromLast = TRUE)))),]
#     SET      ID
#4 100301013 1287030

Or as @chinsoon12 suggested

 m1 <- sapply(df1, function(x) !(duplicated(x)| duplicated(x, fromLast = TRUE)))
 df1[rowSums(m1) == ncol(m1),, drop = FALSE]

data

df1 <- structure(list(SET = c(100301006L, 100301006L, 100301010L, 100301013L, 
100301011L), ID = c(1287025L, 1287026L, 1287027L, 1287030L, 1287027L
)), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 4

ThomasIsCoding
ThomasIsCoding

Reputation: 101335

With base R, maybe you can use ave() to make it:

r <-df[which(with(df,ave(seq(nrow(df)),SET,FUN = length)*ave(seq(nrow(df)),ID,FUN = length)) == 1),]
> r
        SET      ID
4 100301013 1287030

DATA

df <- read.table(text="SET                     ID    
100301006              1287025
100301006              1287026
100301010              1287027
100301013              1287030
100301011              1287027",header = T)

Upvotes: 1

r2evans
r2evans

Reputation: 160437

Here's a quick base-R hack:

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
SET                     ID    
100301006              1287025
100301006              1287026
100301010              1287027
100301013              1287030
100301011              1287027")

counts <- sapply(df, function(x) { tb <- table(x); tb[ match(x, names(tb)) ]; })
counts
#           SET ID
# 100301006   2  1
# 100301006   2  1
# 100301010   1  2
# 100301013   1  1
# 100301011   1  2

At this point, we have the number of times each element is found in its column ... so we want rows where all counts are 1.

df[ rowSums(counts == 1) == ncol(df), ]
#         SET      ID
# 4 100301013 1287030

Upvotes: 3

Related Questions