Reputation: 218
I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column.
Dataset1:
ID
1
2
3
Dataset2:
Interactor1 Interactor2
1 5
2 3
1 10
Output:
ID InteractionCount Interactors
1 2 5, 10
2 1 3
3 1 2
So the output contains all IDs of dataset1 and a count of those IDs also appear in either column 1 or 2 of dataset2, and if it did appear it also stores which ID numbers in dataset2 it interacts with.
I have a biology background, so have guessed at approaching this, so far I've managed to use merge()
and setDT(mergeddata)[, .N, by=ID]
to try to count the dataset1 IDs which appear in dataset2, but I'm not sure if this is the right approach to be able to add in the creation of the column storing the interacting IDs. Any help on possible functions which can store matched IDs in a 3rd column would be appreciated.
Input data:
dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L,
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
Upvotes: 1
Views: 88
Reputation: 1297
Another data.table
answer.
library(data.table)
d1 <- data.table(ID=1:3)
d2 <- data.table(I1=c(1,2,1),I2=c(5,3,10))
# first stack I1 on I2 and vice versa
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
# ID x
# 1: 1 5
# 2: 1 10
# 3: 2 3
# 4: 5 1
# 5: 10 1
# 6: 3 2
# then collect the desired columns
Output <- Output[ID %in% unlist(d1[(ID)])][
,.(InteractionCount=.N,
Interactors = list(x)),
by=ID]
Output
# ID InteractionCount Interactors
# 1: 1 2 5,10
# 2: 2 1 3
# 3: 3 1 2
EDIT:
If the IDs are not numeric, you can set a key on d1
:
library(data.table)
d1 <- data.table(ID=c("1","2","3A"))
setkey(d1,ID)
d2 <- data.table(I1=c("1","2","1"),I2=c("5","3A","10"))
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
# ID x
# 1: 1 5
# 2: 1 10
# 3: 2 3A
# 4: 5 1
# 5: 10 1
# 6: 3A 2
Output <- Output[ID %in% unlist(d1[(ID)])][
,.(InteractionCount=.N,
Interactors = list(x)),
by=ID]
Output
# ID InteractionCount Interactors
# 1: 1 2 5,10
# 2: 2 1 3A
# 3: 3A 1 2
Upvotes: 2
Reputation: 25225
Here is an option using data.table
:
x <- names(DT2)
cols <- c("InteractionCount", "Interactors")
#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)
#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))
#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]
output for dataset1
:
ID InteractionCount Interactors
1: 1 2 5, 10
2: 2 1 3
3: 3 1 2
data:
library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
Upvotes: 2
Reputation: 12471
Here's a solution based on the tidyverse package.
library(tidyverse)
d1 <- tibble(ID=1:3)
d2 <- tibble(Interactor1=c(1, 2, 1), Interactor2=c(5, 3, 10))
I think some of your difficulty is caused by the fact that your data is not tidy. You can read about what this means on the tidyverse homepage. Let's make d2
tidy:
d2narrow <- d2 %>% gather(key="Where", value="ID", Interactor1, Interactor2)
d2narrow
which gives:
# A tibble: 6 x 2
Where ID
<chr> <dbl>
1 Interactor1 1
2 Interactor1 2
3 Interactor1 1
4 Interactor2 5
5 Interactor2 3
6 Interactor2 10
Now getting the InteractionCount
s is easy:
counts <- d2narrow %>% group_by(ID) %>% summarise(InteractionCount=n())
counts
# A tibble: 5 x 2
ID InteractionCount
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 5 1
5 10 1
We can get a list of Interactor2
s for each value of Interactor1
by going back to the original d2
...
interactors1 <- d2 %>%
group_by(Interactor1) %>%
summarise(With1=list(unique(Interactor2))) %>%
rename(ID=Interactor1)
interactors1
# A tibble: 2 x 2
ID With1
<dbl> <list>
1 1 <dbl [2]>
2 2 <dbl [1]>
If an ID
can appear in both Interactor1
and Interactor2
, things get a little more fiddly. (That doesn't happen in your example, but just in case...)
interactors2 <- d2 %>% group_by(Interactor2) %>% summarise(With2=list(unique(Interactor1))) %>% rename(ID=Interactor2)
interactors <- interactors1 %>%
full_join(interactors2, by="ID") %>%
unnest(cols=c(With1, With2)) %>%
mutate(With=ifelse(is.na(With1), With2, With1)) %>%
select(-With1, -With2)
interactors <- interactors %>%
group_by(ID) %>%
summarise(Interactors=list(unique(With)))
Now you can bring everything together, and make sure you get the data only for the ID
s you want:
interactors <- d1 %>% left_join(counts, by="ID") %>% left_join(interactors, by="ID")
interactors
# A tibble: 3 x 3
ID InteractionCount Interactors
<dbl> <int> <list>
1 1 2 <dbl [2]>
2 2 1 <dbl [1]>
3 3 1 <dbl [1]>
That's the data in the format you requested (one column with a list of interactors for each ID). Just to prove it:
interactors$Interactors[1]
[[1]]
[1] 5 10
But I think you might find it easier to do more with the answer if it's in tidy form:
interactors %>% unnest(cols=c(Interactors))
# A tibble: 4 x 3
ID InteractionCount Interactors
<dbl> <int> <dbl>
1 1 2 5
2 1 2 10
3 2 1 3
4 3 1 2
Upvotes: 1