Reputation: 141
I have data of the form:
ID A1 A2 A3 ... A100
1 john max karl ... kevin
2 kevin bosy lary ... rosy
3 karl lary bosy ... hale
.
.
.
10000 isha john lewis ... dave
I want to get one ID for each ID such that both of them have maximum number of common attributes(A1,A2,..A100)
How can I do this in R ? Edit: Let's call the output a MatchId:
ID MatchId
1 70
2 4000
.
.
10000 3000
Upvotes: 0
Views: 90
Reputation: 92300
Using similar data as provided by @hrbrmstr
set.seed(1492)
dat <- do.call(rbind, lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
}))
You could achieve the same using base R only
Res <- sapply(seq_len(nrow(dat)),
function(x) apply(dat[-1], 1,
function(y) length(intersect(dat[x, -1], y))))
diag(Res) <- -1
cbind(dat[1], MatchId = max.col(Res, ties.method = "first"))
# ID MatchId
# 1 1 5
# 2 2 7
# 3 3 5
# 4 4 12
# 5 5 1
# 6 6 9
# 7 7 8
# 8 8 7
# 9 9 10
# 10 10 9
# 11 11 9
# 12 12 13
# 13 13 12
# 14 14 8
# 15 15 2
Upvotes: 2
Reputation: 78842
I think this gets what you're looking for:
library(dplyr)
# make up some data
set.seed(1492)
rbind_all(lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
})) -> dat
print(dat)
## Source: local data frame [15 x 11]
##
## ID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1 1 H F E C B A R J Z N
## 2 2 Q P E M L Z C G V Y
## 3 3 Q J D N B T L K G Z
## 4 4 D Y U F V O I C A W
## 5 5 T Z D I J F R C B S
## 6 6 Q D H U P V O E R N
## 7 7 C L I M E K N S X Z
## 8 8 M J S E N O F Y X I
## 9 9 R H V N M T Q X L S
## 10 10 Q H L Y B W S M P X
## 11 11 M N J K B G S X V R
## 12 12 W X A H Y D N T Q I
## 13 13 K H V J D X Q W A U
## 14 14 M U F H S T W Z O N
## 15 15 G B U Y E L A Q W O
# get commons
rbind_all(lapply(1:15, function(i) {
rbind_all(lapply(setdiff(1:15, i), function(j) {
data.frame(id1=i,
id2=j,
common=length(intersect(c(t(dat[i, 2:11])),
c(t(dat[j, 2:11])))))
}))
})) -> commons
commons %>%
group_by(id1) %>%
top_n(1, common) %>%
filter(row_number()==1) %>%
select(ID=id1, MatchId=id2)
## Source: local data frame [15 x 2]
## Groups: ID
##
## ID MatchId
## 1 1 5
## 2 2 7
## 3 3 5
## 4 4 12
## 5 5 1
## 6 6 9
## 7 7 8
## 8 8 7
## 9 9 10
## 10 10 9
## 11 11 9
## 12 12 13
## 13 13 12
## 14 14 8
## 15 15 2
Upvotes: 2
Reputation: 1417
If I understand correctly, the requirement is to obtain the maximum number of common attributes for each ID.
Frequency tables can be obtained using table()
and recursively in lapply()
, assuming that ID column is unique - slight modification is necessary if not (unique(df$ID)
rather than df$ID
in lapply()
). The maximum frequencies can be taken and, if there is a tie, only the first one is chosen. Finally they are combined by do.call()
.
df <- read.table(header = T, text = "
ID A1 A2 A3 A100
1 john max karl kevin
2 kevin bosy lary rosy
3 karl lary bosy hale
10000 isha john lewis dave")
do.call(rbind, lapply(df$ID, function(x) {
tbl <- table(unlist(df[df$ID == x, 2:ncol(df)]))
data.frame(ID = x, MatchId = tbl[tbl == max(tbl)][1])
}))
# ID MatchId
#john 1 1
#kevin 2 1
#karl 3 1
#isha 10000 1
Upvotes: 0