Reputation: 218

How to conditionally count and record if a sample appears in rows of another dataset?

I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column.

Dataset1:

ID
1
2
3

Dataset2:

Interactor1    Interactor2
1                  5
2                  3
1                  10

Output:

ID   InteractionCount    Interactors
1            2               5, 10
2            1                3
3            1                2

So the output contains all IDs of dataset1 and a count of those IDs also appear in either column 1 or 2 of dataset2, and if it did appear it also stores which ID numbers in dataset2 it interacts with.

I have a biology background, so have guessed at approaching this, so far I've managed to use merge() and setDT(mergeddata)[, .N, by=ID] to try to count the dataset1 IDs which appear in dataset2, but I'm not sure if this is the right approach to be able to add in the creation of the column storing the interacting IDs. Any help on possible functions which can store matched IDs in a 3rd column would be appreciated.

Input data:

dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L, 
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

Upvotes: 1

Answers (3)

DaveTurek

Reputation: 1297

Another data.table answer.

library(data.table)
d1 <- data.table(ID=1:3)
d2 <- data.table(I1=c(1,2,1),I2=c(5,3,10))

# first stack I1 on I2 and vice versa
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3
# 4:  5  1
# 5: 10  1
# 6:  3  2

# then collect the desired columns
Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1           3
# 3:  3                1           2

EDIT: If the IDs are not numeric, you can set a key on d1:

library(data.table)
d1 <- data.table(ID=c("1","2","3A"))
setkey(d1,ID)
d2 <- data.table(I1=c("1","2","1"),I2=c("5","3A","10"))

Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3A
# 4:  5  1
# 5: 10  1
# 6: 3A  2

Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1          3A
# 3:  3A               1           2

Upvotes: 2

chinsoon12

Reputation: 25225

Here is an option using data.table:

x <- names(DT2)
cols <- c("InteractionCount", "Interactors")

#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)

#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
    DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))

#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]

output for dataset1:

   ID InteractionCount Interactors
1:  1                2       5, 10
2:  2                1           3
3:  3                1           2

data:

library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

Upvotes: 2

Limey

Reputation: 12471

Here's a solution based on the tidyverse package.

library(tidyverse)

d1 <- tibble(ID=1:3)
d2 <- tibble(Interactor1=c(1, 2, 1), Interactor2=c(5, 3, 10))

I think some of your difficulty is caused by the fact that your data is not tidy. You can read about what this means on the tidyverse homepage. Let's make d2 tidy:

d2narrow <- d2 %>% gather(key="Where", value="ID", Interactor1, Interactor2)
d2narrow

which gives:

# A tibble: 6 x 2
  Where          ID
  <chr>       <dbl>
1 Interactor1     1
2 Interactor1     2
3 Interactor1     1
4 Interactor2     5
5 Interactor2     3
6 Interactor2    10

Now getting the InteractionCounts is easy:

counts <- d2narrow %>% group_by(ID) %>% summarise(InteractionCount=n())
counts

# A tibble: 5 x 2
     ID InteractionCount
  <dbl>            <int>
1     1                2
2     2                1
3     3                1
4     5                1
5    10                1

We can get a list of Interactor2s for each value of Interactor1 by going back to the original d2...

interactors1 <- d2 %>% 
                  group_by(Interactor1) %>% 
                  summarise(With1=list(unique(Interactor2))) %>% 
                  rename(ID=Interactor1)
interactors1

# A tibble: 2 x 2
     ID With1    
  <dbl> <list>   
1     1 <dbl [2]>
2     2 <dbl [1]>

If an ID can appear in both Interactor1 and Interactor2, things get a little more fiddly. (That doesn't happen in your example, but just in case...)

interactors2 <- d2 %>% group_by(Interactor2) %>% summarise(With2=list(unique(Interactor1))) %>% rename(ID=Interactor2)
interactors <- interactors1 %>% 
                 full_join(interactors2, by="ID") %>% 
                 unnest(cols=c(With1, With2)) %>% 
                 mutate(With=ifelse(is.na(With1), With2, With1)) %>% 
                 select(-With1, -With2)
interactors <- interactors %>% 
                 group_by(ID) %>% 
                 summarise(Interactors=list(unique(With)))

Now you can bring everything together, and make sure you get the data only for the IDs you want:

interactors <- d1 %>% left_join(counts, by="ID") %>% left_join(interactors, by="ID")
interactors

# A tibble: 3 x 3
     ID InteractionCount Interactors
  <dbl>            <int> <list>     
1     1                2 <dbl [2]>  
2     2                1 <dbl [1]>  
3     3                1 <dbl [1]>

That's the data in the format you requested (one column with a list of interactors for each ID). Just to prove it:

interactors$Interactors[1]

[[1]]
[1]  5 10

But I think you might find it easier to do more with the answer if it's in tidy form:

interactors %>% unnest(cols=c(Interactors))

# A tibble: 4 x 3
     ID InteractionCount Interactors
  <dbl>            <int>       <dbl>
1     1                2           5
2     1                2          10
3     2                1           3
4     3                1           2

Upvotes: 1

How to conditionally count and record if a sample appears in rows of another dataset?

Answers (3)

Related Questions