How to account for merge/join adding excess rows?

Question

I have a dataset of genes I map to a type of protein ID. I am then trying to find those protein IDs in another 2nd dataset. The 2nd dataset is pretty large at 11759454 rows. I try to find my matching protein IDs with either merge or join for example:

testdf <- join(proteindf, genes)  #or:
testdf <- merge(proteindf, genes, by.all='protein_id' , all.x=TRUE)

These run but the row orders get out of place with the testdf growing in size to a row number of 11775850.

I am not sure how to address this, I have a biology background, and have tried running an sql version of merging but this runs indefinitely without finishing.

I can't provide full data, but generally the datasets look like:

#gene dataset:

     protein_id             Gene
1   9606.ENSP00000378868    A1CF
2   9606.ENSP00000384794    A4GALT
3   9606.ENSP00000324842    AACS
4   9606.ENSP00000000233    ARF5

#proteindf:
       protein_id                 protein_id1    coexpression experiments database
1   9606.ENSP00000000233    9606.ENSP00000272298        0          0        0
2   9606.ENSP00000000233    9606.ENSP00000253401        0          0        0
3   9606.ENSP00000000233    9606.ENSP00000401445        0          0        0
4   9606.ENSP00000000233    9606.ENSP00000418915        0          0        0

The protein_id rows can be many duplicates which I assume contributes to the problem.

Expected output:

       protein_id           Gene            protein_id1    coexpression experiments database
1   9606.ENSP00000000233     ARF5       9606.ENSP00000272298        0          0        0
2   9606.ENSP00000000233     ARF5       9606.ENSP00000253401        0          0        0
3   9606.ENSP00000000233     ARF5       9606.ENSP00000401445        0          0        0
4   9606.ENSP00000000233     ARF5       9606.ENSP00000418915        0          0        0

I follow this up with creating another dataset using merge (renaming the gene's protein_id to protein_id1) to get the Gene names for the 'protein_id1' column too, this also gives me the same 11775850 rows. Any help in understanding this would be appreciated.

How to account for merge/join adding excess rows?

Answers (1)

Related Questions