Reputation: 923
I have a data frame in R which looks like:
| RIC | Date | Open |
|--------|---------------------|--------|
| S1A.PA | 2011-06-30 20:00:00 | 23.7 |
| ABC.PA | 2011-07-03 20:00:00 | 24.31 |
| EFG.PA | 2011-07-04 20:00:00 | 24.495 |
| S1A.PA | 2011-07-05 20:00:00 | 24.23 |
I want to know if there's any duplicates regarding to the combination of RIC and Date. Is there a function for that in R?
Upvotes: 61
Views: 140779
Reputation: 4910
Building on Joran's answer, here's a dummy df
df <- data.frame(
'let' = c('a', 'a', 'b', 'b', 'c', 'c'),
'num' = c(1, 1, 2, 3, 4, 4),
'ind' = 1:6
)
I have never been super satisfied with base R's way of handling duplicates.As you see, rows 1, 2, 5, 6 are duplicates. Joran's answer returns the unique values, rows 2 and 6 which row-wise are the first cases of duplicates.
> df[duplicated(df[, 1:2]),]
let num ind
2 a 1 2
6 c 4 6
You might want to select all values with duplicated indices. At this point, it's easier to write a little wrapper. For vectors it's easy:
dupvals <- function(x) {duplicated(x) | duplicated(x, fromLast=T)}
For some incomprehensible reason, R's signed method for "rev" on a "data.frame" reverses the columns not the rows. You can't reasonably overload this, either, because it's used in some modeling applications. There are many long winded ways to program around this. My default approach is to define a value called "key" by pasting values across columns.
df$key <- apply(df[, 1:2], 1, paste, collapse=' ')
df[dupvals(df$key), ]
Gives:
> df[dupvals(df$key), ]
let num ind key
1 a 1 1 a 1
2 a 1 2 a 1
5 c 4 5 c 4
6 c 4 6 c 4
Another way:
df[df$key %in% names(which(table(df$key)>1)), ]
Upvotes: 0
Reputation: 219
The way of df [df [, base::c ('key1', 'key2')] |> base::duplicated.data.frame () |> base::which ()]
could only show the surpluses part of the duplicates.
You can use this to filt rows which key(s) is appears more than once:
library (magrittr)
#' @name check_duprows
#' @description
#'
#' check duplicated rows by key(s) in df
#'
#' @example
#' `df %>% check_duprows (key1, key2, ...)`
#'
#' @references
#' - main: [ans-62616469](https://stackoverflow.com/questions/6986657/find-duplicated-rows-based-on-2-columns-in-data-frame-in-r/62616469#62616469)
#' - select except: [ans-49515461](https://stackoverflow.com/questions/49515311/dplyr-select-all-variables-except-for-those-contained-in-vector/49515461#49515461)
#' - sort/order/arrange: [ans-6871968](https://stackoverflow.com/questions/1296646/sort-order-data-frame-rows-by-multiple-columns/6871968#6871968)
#'
check_duprows =
function (df, ..., .show_all = F) df %>%
dplyr::group_by (...) %>%
dplyr::mutate (
.dup_count = dplyr::n (),
.dup_rownum = dplyr::row_number ()) %>%
(dplyr::ungroup) %>%
dplyr::mutate (
.is_duplicated = .dup_rownum > 1,
.has_duplicated = .dup_count > 1) %>%
(\ (tb) if (.show_all) tb else tb %>%
dplyr::filter (.has_duplicated) %>%
dplyr::select (- tidyselect::one_of ('.has_duplicated'))) %>%
dplyr::arrange (...) %>%
{.} ;
Then just use like:
df %>% check_duprows (key1, key2, ...)
Such as:
base::data.frame (
RIC = base::c (
'S1A.PA', 'ABC.PA', 'EFG.PA',
'S1A.PA', 'ABC.PA', 'EFG.PA'),
Date = base::c (
'2011-06-30 20:00:00',
'2011-07-03 20:00:00',
'2011-07-04 20:00:00',
'2011-07-05 20:00:00',
'2011-07-03 20:00:00',
'2011-07-04 20:00:00'),
Open = stats::runif (n=6, min=20, max=30)
) -> df
df %>% check_duprows (RIC, Date)
And you can also define a uniquer by this function:
unique_duprows =
function (df, ...) df %>%
check_duprows(..., .show_all = T) %>%
dplyr::filter(!.is_duplicated) %>%
dplyr::select(- tidyselect::one_of (
'.has_duplicated',
'.is_duplicated',
'.dup_count',
'.dup_rownum')) %>%
{.} ;
df %>% dplyr::arrange (Open) %>% unique_duprows (RIC, Date)
It's just like a distinct
finction !!
Upvotes: 0
Reputation: 1904
Easy way to get the information you want is to use dplyr
.
library(dplyr)
yourDF %>%
group_by(RIC, Date) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1)
# A tibble: 6 × 6
RIC Date open num_dups dup_id is_duplicated
<chr> <chr> <dbl> <int> <int> <lgl>
1 S1A.PA 2011-06-30 20:00:00 23.7 1 1 FALSE
2 ABC.PA 2011-07-03 20:00:00 24.3 2 1 FALSE
3 EFG.PA 2011-07-04 20:00:00 24.5 2 1 FALSE
4 S1A.PA 2011-07-05 20:00:00 24.2 1 1 FALSE
5 ABC.PA 2011-07-03 20:00:00 24.3 2 2 TRUE
6 EFG.PA 2011-07-04 20:00:00 24.5 2 2 TRUE
Using this:
num_dups
tells you how many times that particular combo is duplicateddup_id
tells you which duplicate number that particular row is (e.g. 1st, 2nd, or 3rd, etc)is_duplicated
gives you an easy condition you can filter on later to remove all the duplicate rows (e.g. filter(!is_duplicated)
), though you could also use dup_id
for this (e.g. filter(dup_id == 1)
)Upvotes: 11
Reputation: 41
Found quite a masterful idea posted by Steve Lianouglou that helps solve this problem with the great advantage of indexing the repetitions:
If you generate a hash
column concatenating both your columns for which you want to check duplicates, you can then use dplyr::n()
together with seq
to give an index to each duplicate occurrence as follows
dat %>% mutate(hash = str_c(RIC, Date)) %>%
group_by(hash) %>%
mutate(duplication_id = seq(n()) %>%
ungroup ()
Your column duplication_id
tells you how many identical rows (same row values for both columns) are there in your table above the one indexed. I used this to remove duplicate Ids.
Upvotes: 1
Reputation: 10422
Here's a dplyr
option for tagging duplicates based on two (or more) columns. In this case ric
and date
:
df <- data_frame(ric = c('S1A.PA', 'ABC.PA', 'EFG.PA', 'S1A.PA', 'ABC.PA', 'EFG.PA'),
date = c('2011-06-30 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00', '2011-07-05 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00'),
open = c(23.7, 24.31, 24.495, 24.23, 24.31, 24.495))
df %>%
group_by(ric, date) %>%
mutate(dupe = n()>1)
# A tibble: 6 x 4
# Groups: ric, date [4]
ric date open dupe
<chr> <chr> <dbl> <lgl>
1 S1A.PA 2011-06-30 20:00:00 23.7 FALSE
2 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
3 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
4 S1A.PA 2011-07-05 20:00:00 24.2 FALSE
5 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
6 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
Upvotes: 22
Reputation: 1702
If you want to remove duplicate records based on values of Columns Date and State in dataset data.frame:
#Indexes of the duplicate rows that will be removed:
duplicate_indexes <- which(duplicated(dataset[c('Date', 'State')]),)
duplicate_indexes
#new_uniq will contain unique dataset without the duplicates.
new_uniq <- dataset[!duplicated(dataset[c('Date', 'State')]),]
View(new_uniq)
Upvotes: 5
Reputation: 474
dplyr is so much nicer for this sort of thing:
library(dplyr)
yourDataFrame %>%
distinct(RIC, Date, .keep_all = TRUE)
(the ".keep_all is optional. if not used, it will return only the deduped 2 columns. when used, it returns the deduped whole data frame)
Upvotes: 32
Reputation: 21
I think what you're looking for is a way to return a data frame of the duplicated rows in the same format as your original data. There is probably a more elegant way to do this but this works:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
Upvotes: 2
Reputation: 173517
You can always try simply passing those first two columns to the function duplicated
:
duplicated(dat[,1:2])
assuming your data frame is called dat
. For more information, we can consult the help files for the duplicated
function by typing ?duplicated
at the console. This will provide the following sentences:
Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.
So duplicated
returns a logical vector, which we can then use to extract a subset of dat
:
ind <- duplicated(dat[,1:2])
dat[ind,]
or you can skip the separate assignment step and simply use:
dat[duplicated(dat[,1:2]),]
Upvotes: 86