Reputation: 1481
I do have a statistical routine that does not like row exact duplicates (without ID) as resulting into null distances.
So I first detect duplicates which I remove, apply my routines and merge back records left aside.
For simplicity, consider I use rownames
as ID/key.
I have found following way to achieve my result in base R:
data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3))
# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
dup1 <- duplicated(data)
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]]
# keep only those records that do have duplicates to preveng running folowing steps on all rows
datadup <- data[dupID,]
# "hash" row
rowhash <- apply(datadup, 1, paste, collapse="_")
idmaps <- split(rownames(datadup),rowhash)
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))
Which gives me what I want, ie deduplicated data (easy) and mapping table.
> (data <- data[!dup1,])
x y
1 1 1
4 2 4
6 3 3
> idmaptable
mappedid otherids
1_1.1 1 2
1_1.2 1 3
2_4 4 5
I wonder whether there is a simpler or more effective method (data.table
/ dplyr
accepted). Any alternative to propose?
Upvotes: 0
Views: 231
Reputation: 51592
Some improvements to your base R solution,
df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),]
do.call(rbind, lapply(split(rownames(df),
do.call(paste, c(df, sep = '_'))), function(i)
data.frame(mapped = i[1],
others = i[-1],
stringsAsFactors = FALSE)))
Which gives,
mapped others 1_1.1 1 2 1_1.2 1 3 2_4 4 5
And of course,
unique(data) x y 1 1 1 4 2 4 6 3 3
Upvotes: 1
Reputation: 66819
With data.table...
library(data.table)
setDT(data)
# tag groups of dupes
data[, g := .GRP, by=x:y]
# do whatever analysis
f = function(DT) Reduce(`+`, DT)
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][]
# "update join" the results back to the main table if needed
data[resDT, on=.(g), res := i.res ]
The OP skipped a central part of the example (usage of the deduped data), so I just made up f
.
Upvotes: 4
Reputation: 39154
A solution using tidyverse
. I usually don't store information in the row names, so I created ID
and ID2
to store information. But of course, you can change that based on your needs.
library(tidyverse)
idmaptable <- data %>%
rowid_to_column() %>%
group_by(x, y) %>%
filter(n() > 1) %>%
unite(ID, x, y) %>%
mutate(ID2 = 1:n()) %>%
group_by(ID) %>%
mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>%
spread(ID_type, rowid) %>%
fill(mappedid) %>%
drop_na(otherids) %>%
mutate(ID2 = 1:n())
idmaptable
# A tibble: 3 x 4
# Groups: ID [2]
ID ID2 mappedid otherids
<chr> <int> <int> <int>
1 1_1 1 1 2
2 1_1 2 1 3
3 2_4 1 4 5
Upvotes: 1