Reputation: 11
I am doing some genetics work and I am trying to utilize two data sets, however each one refers to the genes differently. One dataset (lets call it Column A) refers to each gene by its conventional name as well as a specific ID. The other data set uses some combination of those two variables. I'd rather not have to rerun the pipeline to get the names to align, so I was hoping I could match and copy values as per the title. To reiterate, I was hoping that for each value in column B, look in column A for the "best match" and set value to that value.
Below is an example of the columns and my issues
Column A | Column B |
---|---|
gene1-ID001 | gene1 |
gene2-ID002 | ID002 |
gene3-ID003 | gene3-ID003 |
gene4-ID004 | gene4 |
The desired output is as follows:
Column A | Column B |
---|---|
gene1-ID001 | gene1-ID001 |
gene2-ID002 | gene2-ID002 |
gene3-ID003 | gene3-ID003 |
gene4-ID004 | gene4-ID004 |
The names/gene IDs in each data set are tied to different values and are not arranged in a specific order.
Happy to learn about any established tools for these kinds of issues as well. Thanks you.
I'm fairly certain I need to use grep in some manner to check for partial matches before setting values, but the finesse in setting this up is something I'm lacking. Handling issues when there are more than one partial matches and picking the "most matching" is a problem.
Upvotes: 1
Views: 51
Reputation: 79208
You could use agrep
with max distance of 0.
df$C = df[sapply(df[[2]], agrep, df[[1]], max.distance = 0),1]
df
Column_A Column_B C
1 gene1-ID001 gene1 gene1-ID001
2 gene2-ID002 ID002 gene2-ID002
3 gene3-ID003 gene3-ID003 gene3-ID003
4 gene4-ID004 gene4 gene4-ID004
You can use adist
with partial matching:
df$C <- df[max.col(-adist(df[[2]], df[[1]], partial = TRUE)), 1]
df
Column_A Column_B C
1 gene1-ID001 gene1 gene1-ID001
2 gene2-ID002 ID002 gene2-ID002
3 gene3-ID003 gene3-ID003 gene3-ID003
4 gene4-ID004 gene4 gene4-ID004
Upvotes: 0