Reputation: 744
This is a bit more complicated that the title lets on, and I'm sure if I could think of a way to better describe it, I could google it better.
I have data that looks like this:
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027
and I would like to identify and select those rows where each both values in a row have a unique value for the column. In the example above, I want to grab only the row:
100301013 1287030
I don't want SET
100301006
, since it matches to 2 different records in the ID field (1287025
and 1287026
). Similarly, I don't want SET 100301010
since the ID
record it matches to (1287027
) can also match another SET (10030011
).
In some cases there could be more than 2 matches.
I could do this in loops, but that seems like a hack. I'd love a base R or data.table solution, but I'm not so interested in dplyr (trying to minimize dependencies).
Upvotes: 4
Views: 941
Reputation: 1
If we have a dataframe df
and want to find unique values of columns: column1
, column2
, column3
:
library(dplyr)
df <- df %>% group_by(column1,column2,column3) %>% summarise()
Upvotes: 0
Reputation: 28685
You could use data.table to select only groups with 1 row, grouping by ID first, then by SET. This is similar to @r2evans method of checking that the counts for ID and SET are both 1.
library(data.table)
setDT(df)
df[, if(.N == 1) .SD, ID][, if(.N == 1) .SD, SET]
# SET ID
# 1: 100301013 1287030
Or for more than 2 columns
Reduce(function(x, y) x[, if(.N == 1) .SD, y], names(df), init = df)
# ID SET
# 1: 1287030 100301013
Upvotes: 2
Reputation: 887108
We can use duplicated
on each columns independently to create a list
of logical vector
s, Reduce
it to a single vector with &
and use that to subset the rows of the dataset
df1[Reduce(`&`, lapply(df1, function(x)
!(duplicated(x)|duplicated(x, fromLast = TRUE)))),]
# SET ID
#4 100301013 1287030
Or as @chinsoon12 suggested
m1 <- sapply(df1, function(x) !(duplicated(x)| duplicated(x, fromLast = TRUE)))
df1[rowSums(m1) == ncol(m1),, drop = FALSE]
df1 <- structure(list(SET = c(100301006L, 100301006L, 100301010L, 100301013L,
100301011L), ID = c(1287025L, 1287026L, 1287027L, 1287030L, 1287027L
)), class = "data.frame", row.names = c(NA, -5L))
Upvotes: 4
Reputation: 101335
With base R
, maybe you can use ave()
to make it:
r <-df[which(with(df,ave(seq(nrow(df)),SET,FUN = length)*ave(seq(nrow(df)),ID,FUN = length)) == 1),]
> r
SET ID
4 100301013 1287030
DATA
df <- read.table(text="SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027",header = T)
Upvotes: 1
Reputation: 160437
Here's a quick base-R hack:
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027")
counts <- sapply(df, function(x) { tb <- table(x); tb[ match(x, names(tb)) ]; })
counts
# SET ID
# 100301006 2 1
# 100301006 2 1
# 100301010 1 2
# 100301013 1 1
# 100301011 1 2
At this point, we have the number of times each element is found in its column ... so we want rows where all counts are 1.
df[ rowSums(counts == 1) == ncol(df), ]
# SET ID
# 4 100301013 1287030
Upvotes: 3