Reputation: 970
Using R. Base package, dplyr, or data.table are all okay for me to use. My data is ~1000 rows x 20 columns. I expect about 300 duplicates.
I'd like to do something like the following, but with one alteration:
Match/group duplicate rows (indices)
I'd like to find, not fully duplicated rows, but rows duplicated in two columns. For example, given this input table:
File T.N ID Col1 Col2
BAI.txt T 1 sdaf eiri
BAJ.txt N 2 fdd fds
BBK.txt T 1 ter ase
BCD.txt N 1 twe ase
If I want to find duplicates in T.N & ID only, I'd end up with the following table:
File T.N ID Col1 Col2
BAI.txt T 1 sdaf eiri
BBK.txt T 1 ter ase
Upvotes: 16
Views: 28184
Reputation: 1441
Just use the selected columns to create the index you want to subset on
data <- read.csv()
data <- data[which(duplicated(data[,c('T.N','ID')])==TRUE),]
Upvotes: 4
Reputation: 1960
A simple solution is find_duplicates
from hablar
library(dplyr)
library(data.table)
library(hablar)
df <- fread("
File T.N ID Col1 Col2
BAI.txt T 1 sdaf eiri
BAJ.txt N 2 fdd fds
BBK.txt T 1 ter ase
BCD.txt N 1 twe ase
")
df %>%
find_duplicates(T.N, ID)
which returns the rows with duplicates in T.N and ID:
File T.N ID Col1 Col2
<chr> <chr> <int> <chr> <chr>
1 BAI.txt T 1 sdaf eiri
2 BBK.txt T 1 ter ase
Upvotes: 6
Reputation: 119
I have found this to be an easy and useful method.
tr <- tribble(~File, ~TN, ~ID, ~Col1, ~Col2,
'BAI.txt', 'T', 1, 'sdaf', 'eiri',
'BAJ.txt', 'N', 2, 'fdd', 'fds',
'BBK.txt', 'T', 1, 'ter', 'ase',
'BCD.txt', 'N', 1, 'twe', 'ase')
group_by(tr, TN, ID) %>%
filter(n() > 1)
Output:
# A tibble: 2 x 5
# Groups: TN, ID [1]
File TN ID Col1 Col2
<chr> <chr> <dbl> <chr> <chr>
1 BAI.txt T 1 sdaf eiri
2 BBK.txt T 1 ter ase
Upvotes: 8
Reputation: 4472
Here is an option using duplicated
twice, second time along with fromLast = TRUE
option because it returns TRUE only from the duplicate value on-wards
dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]
# File T.N ID Col1 Col2
#1 BAI.txt T 1 sdaf eiri
#3 BBK.txt T 1 ter ase
Upvotes: 23