Reputation: 175
I would like to identify and mark duplicate rows based on 2 columns. I would like to make a unique identifier for each duplicate so I know not just that the row is a duplicate, but which row it is a duplicate with. I have a dataframe that looks like below with some duplicate item pairs (on fit and sit) and other pairs that are not duplicated. While the item pairs are duplicated, the information they contain is unique (e.g., one row will have a value in Value1 for 1 row, but not Value2 and Value 3, the second or 'duplicate' row will have numbers for Value2 and Value3 just not Value1)
value1 value2 value3 fit sit
[1,] "1" NA NA "it1" "it2"
[2,] NA "3" "2" "it2" "it1"
[3,] "2" "3" "4" "it3" "it4"
[4,] NA NA NA "it4" "it3"
[5,] "5" NA NA "it5" "it6"
[6,] NA NA "2" "it6" "it5"
[7,] NA "4" NA "it7" "it9"
code to generate example dataframe
value1<-c(1,NA,2,NA,5,NA,NA)
value2<-c(NA,3,3,NA,NA,NA, 4)
value3<-c(NA,2,4,NA,NA,2, NA)
fit<-c("it1","it2","it3","it4", "it5", "it6","it7")
sit<-c("it2","it1","it4","it3", "it6", "it5", "it9")
df.now<-cbind(value1,value2,value3, fit, sit)
what I want is to convert it to a dataframe that looks like this:
val1 val2 val3 it1 it2
[1,] "1" "3" "2" "it1" "it2"
[2,] "2" "3" "4" "it3" "it4"
[3,] "5" NA "2" "it5" "it6"
[4,] NA "4" NA "it7" "it9"
I was thinking of doing the following steps: 1. create new variables using fit and sit with the lowest item and highest items to identify duplicate pairs 2. identify duplicated item pairs 3. use ifelse to select and fill in unique information.
I know how to do steps 1 and 3, but am stuck on step 2. I think what I need to do is not just identify TRUE/FALSE duplicate, but perhaps have a column with a unique identifier for each item pair like this (there are 2 extra rows because of my step 1):
value1 value2 value3 fit sit lit hit dup
[1,] "1" NA NA "it1" "it2" "it1" "it2" 1
[2,] NA "3" "2" "it2" "it1" "it1" "it2" 1
[3,] "2" "3" "4" "it3" "it4" "it3" "it4" 2
[4,] NA NA NA "it4" "it3" "it3" "it4" 2
[5,] "5" NA NA "it5" "it6" "it5" "it6" 3
[6,] NA NA "2" "it6" "it5" "it5" "it6" 3
[7,] NA "4" NA "it7" "it9" "it7" "it9" NA
I am not sure how to do this.
What I am asking for is either help with step 2 or perhaps there is a better way to solve it than the steps I outlined.
Upvotes: 12
Views: 2023
Reputation: 25225
Another data.table
option:
library(data.table)
as.data.table(df.now)[, lapply(.SD, function(x) first(x[!is.na(x)])),
.(it1=pmin(fit, sit), it2=pmax(fit, sit)),
.SDcols=value1:value3]
output:
it1 it2 value1 value2 value3
1: it1 it2 1 3 2
2: it3 it4 2 3 4
3: it5 it6 5 <NA> 2
4: it7 it9 <NA> 4 <NA>
Upvotes: 2
Reputation: 6234
This can also be done using tidyr
's pivot_longer
with values_drop_na = TRUE
combined with pivot_wider
:
library(tidyverse)
mydf %>%
mutate(it1 = pmin(fit, sit), it2 = pmax(fit, sit)) %>%
pivot_longer(cols = starts_with("value"), values_drop_na = TRUE) %>%
pivot_wider(id_cols = c("it1", "it2"))
#> # A tibble: 4 x 5
#> it1 it2 value1 value2 value3
#> <chr> <chr> <int> <int> <int>
#> 1 it1 it2 1 3 2
#> 2 it3 it4 2 3 4
#> 3 it5 it6 5 NA 2
#> 4 it7 it9 NA 4 NA
Data
mydf <- structure(list(value1 = c(1L, NA, 2L, NA, 5L, NA, NA), value2 = c(NA,
3L, 3L, NA, NA, NA, 4L), value3 = c(NA, 2L, 4L, NA, NA, 2L, NA
), fit = c("it1", "it2", "it3", "it4", "it5", "it6", "it7"),
sit = c("it2", "it1", "it4", "it3", "it6", "it5", "it9")), class = "data.frame", row.names = c(NA,
-7L))
Upvotes: 1
Reputation: 886938
Using melt/dcast
from data.table
library(data.table)
dcast(melt(setDT(df.now)[, c('fit1', 'sit1') := .(pmin(fit, sit),
pmax(fit, sit))], measure = patterns("^value"), na.rm = TRUE),
fit1 + sit1 ~ variable, value.var = 'value')
# fit1 sit1 value1 value2 value3
#1: it1 it2 1 3 2
#2: it3 it4 2 3 4
#3: it5 it6 5 NA 2
#4: it7 it9 NA 4 NA
df.now <- data.frame(value1,value2,value3, fit, sit, stringsAsFactors = FALSE)
Upvotes: 2
Reputation: 23574
Here is my attempt using data.table. Your data is called mydf
. First, I sorted fit
and sit
for each row and created a new variable, group
. Then, for each group, I sorted values in the three value columns (i.e., value1, value2, and value3). Finally, I extracted first row for each group.
library(data.table)
mydt <- setDT(mydf)[, group := paste(sort(.SD), collapse = "_"),
.SD = c("fit", "sit"), by = 1:nrow(mydf)][,
c("value1", "value2", "value3") := lapply(.SD, sort),
.SDcols = value1:value3, by = group][, .SD[1], by = group]
mydt[]
# group value1 value2 value3 fit sit
#1: it1_it2 1 3 2 it1 it2
#2: it3_it4 2 3 4 it3 it4
#3: it5_it6 5 NA 2 it5 it6
#4: it7_it9 NA 4 NA it7 it9
DATA
mydf <- structure(list(value1 = c(1L, NA, 2L, NA, 5L, NA, NA), value2 = c(NA,
3L, 3L, NA, NA, NA, 4L), value3 = c(NA, 2L, 4L, NA, NA, 2L, NA
), fit = c("it1", "it2", "it3", "it4", "it5", "it6", "it7"),
sit = c("it2", "it1", "it4", "it3", "it6", "it5", "it9")), class = "data.frame", row.names = c(NA,
-7L))
Upvotes: 1
Reputation: 39858
One dplyr
option could be:
df.now %>%
group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)),
NA,
first(na.omit(.))))
pair value1 value2 value3
<chr> <dbl> <dbl> <dbl>
1 it2_it1 1 3 2
2 it4_it3 2 3 4
3 it6_it5 5 NA 2
4 it9_it7 NA 4 NA
And if you also need the pairs in individual columns, then with the addition of tidyr
you can do:
df.now %>%
group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)),
NA,
first(na.omit(.)))) %>%
separate(pair, into = c("fit", "hit"), sep = "_", remove = FALSE)
pair fit hit value1 value2 value3
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 it2_it1 it2 it1 1 3 2
2 it4_it3 it4 it3 2 3 4
3 it6_it5 it6 it5 5 NA 2
4 it9_it7 it9 it7 NA 4 NA
Upvotes: 6
Reputation: 72593
Use !duplicated()
after sort
ing.
df.now[!duplicated(t(apply(df.now[, c("fit", "sit")], 1, sort))), ]
# value1 value2 value3 fit sit
# [1,] "1" NA NA "it1" "it2"
# [2,] "2" "3" "4" "it3" "it4"
# [3,] "5" NA NA "it5" "it6"
# [4,] NA "4" NA "it7" "it9"
Upvotes: 3