Reputation: 11
I have a dataframe (~36K rows, 3 columns) part of which is shown below.
hgnc_symbol | ensembl_gene_id | external_synonym | |
---|---|---|---|
1 | A1CF | ENSG00000148584 | ACF |
2 | A1CF | ENSG00000148584 | ACF64 |
3 | A1CF | ENSG00000148584 | ACF65 |
4 | A1CF | ENSG00000148584 | APOBEC1CF |
5 | A1CF | ENSG00000148584 | ASP |
6 | A2M | ENSG00000175899 | CPAMD5 |
7 | A2M | ENSG00000175899 | FWP007 |
8 | A2M | ENSG00000175899 | S863-7 |
9 | A2ML1 | ENSG00000166535 | CPAMD9 |
10 | A2ML1 | ENSG00000166535 | FLJ25179 |
11 | A2ML1 | ENSG00000166535 | p170 |
12 | AADAC | ENSG00000114771 | CES5A1 |
13 | AADAC | ENSG00000114771 | DAC |
14 | AADACL2 | ENSG00000261846 | MGC72001 |
15 | AADACL2 | ENSG00000197953 | MGC72001 |
I would like to find rows for which the hgnc_symbol column values are the same but the ensembl_gene_id column value is different. I would like to retrieve such entries. The result in this case should look something like this:
hgnc_symbol | ensembl_gene_id | external_synonym | |
---|---|---|---|
14 | AADACL2 | ENSG00000261846 | MGC72001 |
15 | AADACL2 | ENSG00000197953 | MGC72001 |
I would really appreciate any solutions to this.
Upvotes: 1
Views: 34
Reputation: 19142
You can group_by
your relevant columns, then filter
for number of observations n()
that equals 1.
library(dplyr)
df %>% group_by(hgnc_symbol, ensembl_gene_id) %>% filter(n() == 1)
# A tibble: 2 × 3
# Groups: hgnc_symbol, ensembl_gene_id [2]
hgnc_symbol ensembl_gene_id external_synonym
<chr> <chr> <chr>
1 AADACL2 ENSG00000261846 MGC72001
2 AADACL2 ENSG00000197953 MGC72001
Upvotes: 2