Reputation: 141

Getting the maximum common words in R

I have data of the form:

ID       A1    A2    A3   ...   A100
1      john    max  karl  ...   kevin
2      kevin   bosy lary  ...   rosy
3      karl   lary  bosy  ...   hale
.
.
.
10000  isha   john  lewis ...   dave

I want to get one ID for each ID such that both of them have maximum number of common attributes(A1,A2,..A100)

How can I do this in R ? Edit: Let's call the output a MatchId:

ID      MatchId
1        70
2        4000
.
.
10000   3000

Upvotes: 0

Answers (3)

David Arenburg

Reputation: 92300

Using similar data as provided by @hrbrmstr

set.seed(1492)
dat <- do.call(rbind, lapply(1:15, function(i) {
  x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
  colnames(x) <- c("ID", sprintf("A%d", 1:10))
  x
}))

You could achieve the same using base R only

Res <- sapply(seq_len(nrow(dat)), 
              function(x) apply(dat[-1], 1, 
              function(y) length(intersect(dat[x, -1], y))))
diag(Res) <- -1
cbind(dat[1], MatchId = max.col(Res, ties.method = "first"))
#    ID MatchId
# 1   1       5
# 2   2       7
# 3   3       5
# 4   4      12
# 5   5       1
# 6   6       9
# 7   7       8
# 8   8       7
# 9   9      10
# 10 10       9
# 11 11       9
# 12 12      13
# 13 13      12
# 14 14       8
# 15 15       2

Upvotes: 2

hrbrmstr

Reputation: 78842

I think this gets what you're looking for:

library(dplyr)

# make up some data

set.seed(1492)
rbind_all(lapply(1:15, function(i) {
  x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
  colnames(x) <- c("ID", sprintf("A%d", 1:10))
  x
})) -> dat

print(dat)

## Source: local data frame [15 x 11]
## 
##    ID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1   1  H  F  E  C  B  A  R  J  Z   N
## 2   2  Q  P  E  M  L  Z  C  G  V   Y
## 3   3  Q  J  D  N  B  T  L  K  G   Z
## 4   4  D  Y  U  F  V  O  I  C  A   W
## 5   5  T  Z  D  I  J  F  R  C  B   S
## 6   6  Q  D  H  U  P  V  O  E  R   N
## 7   7  C  L  I  M  E  K  N  S  X   Z
## 8   8  M  J  S  E  N  O  F  Y  X   I
## 9   9  R  H  V  N  M  T  Q  X  L   S
## 10 10  Q  H  L  Y  B  W  S  M  P   X
## 11 11  M  N  J  K  B  G  S  X  V   R
## 12 12  W  X  A  H  Y  D  N  T  Q   I
## 13 13  K  H  V  J  D  X  Q  W  A   U
## 14 14  M  U  F  H  S  T  W  Z  O   N
## 15 15  G  B  U  Y  E  L  A  Q  W   O

# get commons

rbind_all(lapply(1:15, function(i) {
  rbind_all(lapply(setdiff(1:15, i), function(j) {
    data.frame(id1=i,
               id2=j,
               common=length(intersect(c(t(dat[i, 2:11])),
                                       c(t(dat[j, 2:11])))))
  }))
})) -> commons

commons %>%
  group_by(id1) %>%
  top_n(1, common) %>%
  filter(row_number()==1) %>%
  select(ID=id1, MatchId=id2)

## Source: local data frame [15 x 2]
## Groups: ID
## 
##    ID MatchId
## 1   1       5
## 2   2       7
## 3   3       5
## 4   4      12
## 5   5       1
## 6   6       9
## 7   7       8
## 8   8       7
## 9   9      10
## 10 10       9
## 11 11       9
## 12 12      13
## 13 13      12
## 14 14       8
## 15 15       2

Upvotes: 2

Jaehyeon Kim

Reputation: 1417

If I understand correctly, the requirement is to obtain the maximum number of common attributes for each ID.

Frequency tables can be obtained using table() and recursively in lapply(), assuming that ID column is unique - slight modification is necessary if not (unique(df$ID) rather than df$ID in lapply()). The maximum frequencies can be taken and, if there is a tie, only the first one is chosen. Finally they are combined by do.call().

df <- read.table(header = T, text = "
ID       A1    A2    A3   A100
1      john    max  karl  kevin
2      kevin   bosy lary  rosy
3      karl   lary  bosy  hale
10000  isha   john  lewis dave")

do.call(rbind, lapply(df$ID, function(x) {
  tbl <- table(unlist(df[df$ID == x, 2:ncol(df)]))
  data.frame(ID = x, MatchId = tbl[tbl == max(tbl)][1])
}))

#         ID MatchId
#john      1       1
#kevin     2       1
#karl      3       1
#isha  10000       1

Upvotes: 0

Getting the maximum common words in R

Answers (3)

Related Questions