Reputation: 27

calculate number of rows that appear in the same group using R

I have a dataframe looks like this:

|A      |B      |
|-------|-------|
|1      |MPRIP  |
|1      |RAI14  |
|1      |TPM1   |
|MPRIP  |RAI14  |
|MPRIP  |CDK1   |
|2      |APOBEC1|
|2      |KHSRP  |
|2      |SYNCRIP|
|APOBEC1|SYNCRIP|

Dput code:

structure(list(OFFICIAL_SYMBOL_A = c("1810055G02Rik", "1810055G02Rik", 
"1810055G02Rik", "2810046L04Rik", "2810046L04Rik", "4922501C03Rik", 
"4930572J05Rik", "4930572J05Rik", "4930572J05Rik", "4930572J05Rik", 
"4930572J05Rik", "4930572J05Rik", "9830001H06Rik", "9830001H06Rik", 
"9830001H06Rik", "9830001H06Rik", "9830001H06Rik", "9830001H06Rik", 
"9830001H06Rik", "9830001H06Rik", "9830001H06Rik", "A1CF", "A1CF", 
"A1CF", "A2M", "A2M", "A2M", "A2M", "A2M", "A2M", "A2M", "A2M", 
"AAGAB", "AATF", "AATF", "AATF", "AATF", "AATF", "AATF", "AATF", 
"AATF", "AATF", "AATF", "ABCA1", "ABCA1", "ABCA1", "ABCA1", "ABCA1", 
"ABCA1", "ABCA1", "ABCA1", "ABCA1", "ABCA1", "ABCA1", "ABCA1", 
"ABCA1", "ABCA1", "ABCA1", "ABCA13", "ABCA13", "ABCA2", "ABCA4", 
"ABCB1", "ABCB1", "ABCB1", "ABCB7", "ABCC2", "ABCC2", "ABCC2", 
"ABCC8", "ABCC8", "ABCD1", "ABCD3", "ABCD3", "ABCD4", "ABCD4", 
"ABCE1", "ABCE1", "ABCF3", "ABCF3", "ABCF3", "ABCF3", "ABCG1", 
"ABCG5", "ABHD16A", "ABHD16A", "ABHD16A", "ABHD16A", "ABHD16A", 
"ABHD16A", "ABHD16A", "ABHD16A", "ABHD16A", "ABHD16A", "ABI1", 
"ABI1", "ABI1", "ABI1", "ABI2", "ABI2"), OFFICIAL_SYMBOL_B = c("MPRIP", 
"RAI14", "TPM1", "ARF1", "ARF3", "CPNE4", "C8orf55", "PRKDC", 
"SPRR2B", "SPRR2D", "SPRR2E", "SPRR2G", "C1QBP", "CCDC165", "DYNLL1", 
"KIAA0889", "PPP2CA", "PPP2R1A", "PPP2R5A", "PPP2R5E", "RNF41", 
"APOBEC1", "KHSRP", "SYNCRIP", "ADAMTS1", "APOE", "IL10", "IL4", 
"LCAT", "LEP", "NGF", "PAEP", "EIF3C", "CHEK2", "MAGED1", "MAPT", 
"PAWR", "PCBD2", "RB1", "RBL1", "RBL2", "RELA", "Tsg101", "AOX1", 
"APOA1", "CDC42", "COPS5", "CREBBP", "FADD", "FLOT1", "HGS", 
"PRPF8", "SDHB", "SNTB2", "STX12", "UBC", "UGP1", "XPC", "APOA1", 
"APOA2", "CDK5RAP2", "CNGB1", "DHX9", "PIM1", "UBC", "FECH", 
"PDZD3", "Rdx", "SLC9A3R1", "KCNJ11", "RAPGEF4", "ABCD2", "ABCD1", 
"PEX19", "PEA15", "XRCC6", "RNASEL", "UBASH3A", "ACIN1", "DNALI1", 
"LAMTOR1", "TOE1", "UBC", "ABCG8", "ATP5G3", "DNAJC1", "GPRC5C", 
"HM13", "IFITM1", "RNF5", "SAFB", "SPAG7", "TMEM147", "TMEM222", 
"ABL1", "ENAH", "EPS8", "NCK1", "ABL1", "CCDC53")), row.names = c(NA, 
100L), class = "data.frame")

You can imagine A is a node and it is connected to B. In this case, 1 is connected to three different genes (as shown in B) and among these genes, MPRIP is also connected to RAI14 (as shown in row #4). I want to calculate the number of links among column B that connected to the same group (column A) . For example, in this is table, since there are three of them in group 1 and MPRIP matches RAI14 (both in the same group), then the number would be 1. Below is my expected output.

|A      |number of links among neighbors|
|-------|-------------------------------|
|1      |1                              |
|MPRIP  |0                              |
|2      |1                              |
|APOBEC1|0                              |

Thank you in advance!

Upvotes: 0

Answers (2)

Dave2e

Reputation: 24139

Based on your expanded sample.
This maybe what you are looking for. Basically grouping by the first column (column A) and then looking to see if any of the elements from column B for each group matches any of the elements in column A

library(dplyr)   
df<-structure(list(A = c("1", "1", "1", "MPRIP", "MPRIP", "2", "2", 
                         "2", "APOBEC1"), B = c("MPRIP", "RAI14", "TPM1", "RAI14", "CDK1", 
                                                "APOBEC1", "KHSRP", "SYNCRIP", "SYNCRIP")), row.names = c(NA, 
                                                                                                          -9L), class = c("tbl_df", "tbl", "data.frame"))

df %>% group_by(A) %>% summarize(sum=sum(B %in% df$A))

#df %>% group_by(OFFICIAL_SYMBOL_A) %>% summarize(matches =sum(OFFICIAL_SYMBOL_B %in% df$OFFICIAL_SYMBOL_A))

By the way... We Are!!!

Upvotes: 1

dash2

Reputation: 2262

Gotta get the data structures right:

library(dplyr)


links <- inner_join(mydata, mydata, by = c("B" = "A"))
# now B and B.y represent your links

# join back up to the original data:
links <- inner_join(links, mydata, by = c("B.y" = "B"))

# make sure both links are in the same "A" group
links <- filter(links, A.x == A.y)  

links %>% group_by(A.x) %>% summarize(link_count = n())

This doesn't include any zeros. You could left join the result back on to mydata and fill in the zeros, if you want.

Upvotes: 1

calculate number of rows that appear in the same group using R

Answers (2)

Related Questions