DN1
DN1

Reputation: 218

How to account for merge/join adding excess rows?

I have a dataset of genes I map to a type of protein ID. I am then trying to find those protein IDs in another 2nd dataset. The 2nd dataset is pretty large at 11759454 rows. I try to find my matching protein IDs with either merge or join for example:

testdf <- join(proteindf, genes)  #or:
testdf <- merge(proteindf, genes, by.all='protein_id' , all.x=TRUE)

These run but the row orders get out of place with the testdf growing in size to a row number of 11775850.

I am not sure how to address this, I have a biology background, and have tried running an sql version of merging but this runs indefinitely without finishing.

I can't provide full data, but generally the datasets look like:

#gene dataset:

     protein_id             Gene
1   9606.ENSP00000378868    A1CF
2   9606.ENSP00000384794    A4GALT
3   9606.ENSP00000324842    AACS
4   9606.ENSP00000000233    ARF5 
#proteindf:
       protein_id                 protein_id1    coexpression experiments database
1   9606.ENSP00000000233    9606.ENSP00000272298        0          0        0
2   9606.ENSP00000000233    9606.ENSP00000253401        0          0        0
3   9606.ENSP00000000233    9606.ENSP00000401445        0          0        0
4   9606.ENSP00000000233    9606.ENSP00000418915        0          0        0

The protein_id rows can be many duplicates which I assume contributes to the problem.

Expected output:

       protein_id           Gene            protein_id1    coexpression experiments database
1   9606.ENSP00000000233     ARF5       9606.ENSP00000272298        0          0        0
2   9606.ENSP00000000233     ARF5       9606.ENSP00000253401        0          0        0
3   9606.ENSP00000000233     ARF5       9606.ENSP00000401445        0          0        0
4   9606.ENSP00000000233     ARF5       9606.ENSP00000418915        0          0        0

I follow this up with creating another dataset using merge (renaming the gene's protein_id to protein_id1) to get the Gene names for the 'protein_id1' column too, this also gives me the same 11775850 rows. Any help in understanding this would be appreciated.

Upvotes: 0

Views: 35

Answers (1)

Oliver
Oliver

Reputation: 8572

It is not clear what the problem is exactly, due to lack of coding example.

First, by.all should likely be by. Second, the reason you're seeing an increased number of observations can only be, that there are several matches for some values in protein_id in gene (eg. the id is not unique in the genes dataset).

We can extract the duplicate rows using testdf[duplicated(testdf$protein_id),] or:

genes$growid <- seq_len(nrow(genes))
proteindf$prowid <- seq_len(nrow(proteindf))
mdf <- merge(proteindf, merge, by = 'protein_id', all.x = TRUE)
gids_dups <- duplicated(mdf$growid)
pids_dups <- duplicated(mdf$prowid)
#Gene duplicate rows
mdf[gids_dups, ] 
#protein duplicate rows (should be the same)
mdf[pids_dups, ] 

The solution to this problem depends on your dataset, if it is indeed a problem.

Upvotes: 1

Related Questions