Fazal123
Fazal123

Reputation: 11

Finding and retrieving rows for which values of two columns are different

I have a dataframe (~36K rows, 3 columns) part of which is shown below.

hgnc_symbol ensembl_gene_id external_synonym
1 A1CF ENSG00000148584 ACF
2 A1CF ENSG00000148584 ACF64
3 A1CF ENSG00000148584 ACF65
4 A1CF ENSG00000148584 APOBEC1CF
5 A1CF ENSG00000148584 ASP
6 A2M ENSG00000175899 CPAMD5
7 A2M ENSG00000175899 FWP007
8 A2M ENSG00000175899 S863-7
9 A2ML1 ENSG00000166535 CPAMD9
10 A2ML1 ENSG00000166535 FLJ25179
11 A2ML1 ENSG00000166535 p170
12 AADAC ENSG00000114771 CES5A1
13 AADAC ENSG00000114771 DAC
14 AADACL2 ENSG00000261846 MGC72001
15 AADACL2 ENSG00000197953 MGC72001

I would like to find rows for which the hgnc_symbol column values are the same but the ensembl_gene_id column value is different. I would like to retrieve such entries. The result in this case should look something like this:

hgnc_symbol ensembl_gene_id external_synonym
14 AADACL2 ENSG00000261846 MGC72001
15 AADACL2 ENSG00000197953 MGC72001

I would really appreciate any solutions to this.

Upvotes: 1

Views: 34

Answers (1)

benson23
benson23

Reputation: 19142

You can group_by your relevant columns, then filter for number of observations n() that equals 1.

library(dplyr)

df %>% group_by(hgnc_symbol, ensembl_gene_id) %>% filter(n() == 1)

# A tibble: 2 × 3
# Groups:   hgnc_symbol, ensembl_gene_id [2]
  hgnc_symbol ensembl_gene_id external_synonym
  <chr>       <chr>           <chr>           
1 AADACL2     ENSG00000261846 MGC72001        
2 AADACL2     ENSG00000197953 MGC72001        

Upvotes: 2

Related Questions