Reputation: 1673
There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
Upvotes: 43
Views: 27063
Reputation: 19088
An approach using vctrs::vec_duplicate_detect
Original example
library(vctrs)
vec <- c(1, 2, 2, 3, 4, 3, 2)
vec[!vec_duplicate_detect(vec)]
[1] 1 4
On a data.frame
df
a b d
1 1 1 1
2 1 1 1
3 2 2 2
4 3 3 4
df[!vec_duplicate_detect(df),]
a b d
3 2 2 2
4 3 3 4
length(vec)
[1] 175120
library(microbenchmark)
microbenchmark(
base = {vec[!(duplicated(vec) | duplicated(vec, fromLast=T))]},
vctrs = {vec[!vec_duplicate_detect(vec)]})
Unit: milliseconds
expr min lq mean median uq max neval
base 12.241369 14.408094 16.70000 16.94082 17.26830 26.69546 100
vctrs 7.526593 9.701161 11.43675 10.80420 11.64395 19.80494 100
Upvotes: 1
Reputation: 39858
A possibility involving dplyr
could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0
, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Upvotes: 13
Reputation: 81693
This will extract the rows which appear only once (assuming your data frame is named df
):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated
tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE
is used, the function starts at the last line.
Boths boolean results are combined with |
(logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using !
thereby creating a boolean vector indicating lines appearing only once.
Upvotes: 90