Reputation: 218
I have a dataset of genes I map to a type of protein ID. I am then trying to find those protein IDs in another 2nd dataset. The 2nd dataset is pretty large at 11759454 rows. I try to find my matching protein IDs with either merge or join for example:
testdf <- join(proteindf, genes) #or:
testdf <- merge(proteindf, genes, by.all='protein_id' , all.x=TRUE)
These run but the row orders get out of place with the testdf
growing in size to a row number of 11775850.
I am not sure how to address this, I have a biology background, and have tried running an sql version of merging but this runs indefinitely without finishing.
I can't provide full data, but generally the datasets look like:
#gene dataset:
protein_id Gene
1 9606.ENSP00000378868 A1CF
2 9606.ENSP00000384794 A4GALT
3 9606.ENSP00000324842 AACS
4 9606.ENSP00000000233 ARF5
#proteindf:
protein_id protein_id1 coexpression experiments database
1 9606.ENSP00000000233 9606.ENSP00000272298 0 0 0
2 9606.ENSP00000000233 9606.ENSP00000253401 0 0 0
3 9606.ENSP00000000233 9606.ENSP00000401445 0 0 0
4 9606.ENSP00000000233 9606.ENSP00000418915 0 0 0
The protein_id rows can be many duplicates which I assume contributes to the problem.
Expected output:
protein_id Gene protein_id1 coexpression experiments database
1 9606.ENSP00000000233 ARF5 9606.ENSP00000272298 0 0 0
2 9606.ENSP00000000233 ARF5 9606.ENSP00000253401 0 0 0
3 9606.ENSP00000000233 ARF5 9606.ENSP00000401445 0 0 0
4 9606.ENSP00000000233 ARF5 9606.ENSP00000418915 0 0 0
I follow this up with creating another dataset using merge (renaming the gene's protein_id to protein_id1) to get the Gene names for the 'protein_id1' column too, this also gives me the same 11775850 rows. Any help in understanding this would be appreciated.
Upvotes: 0
Views: 35
Reputation: 8572
It is not clear what the problem is exactly, due to lack of coding example.
First, by.all
should likely be by
. Second, the reason you're seeing an increased number of observations can only be, that there are several matches for some values in protein_id
in gene
(eg. the id is not unique in the genes dataset).
We can extract the duplicate rows using testdf[duplicated(testdf$protein_id),]
or:
genes$growid <- seq_len(nrow(genes))
proteindf$prowid <- seq_len(nrow(proteindf))
mdf <- merge(proteindf, merge, by = 'protein_id', all.x = TRUE)
gids_dups <- duplicated(mdf$growid)
pids_dups <- duplicated(mdf$prowid)
#Gene duplicate rows
mdf[gids_dups, ]
#protein duplicate rows (should be the same)
mdf[pids_dups, ]
The solution to this problem depends on your dataset, if it is indeed a problem.
Upvotes: 1