awsk
awsk

Reputation: 11

R Data Manipulation Help: For each value in column B, look in column A for the "best match" and set value to that value

I am doing some genetics work and I am trying to utilize two data sets, however each one refers to the genes differently. One dataset (lets call it Column A) refers to each gene by its conventional name as well as a specific ID. The other data set uses some combination of those two variables. I'd rather not have to rerun the pipeline to get the names to align, so I was hoping I could match and copy values as per the title. To reiterate, I was hoping that for each value in column B, look in column A for the "best match" and set value to that value.

Below is an example of the columns and my issues

Column A Column B
gene1-ID001 gene1
gene2-ID002 ID002
gene3-ID003 gene3-ID003
gene4-ID004 gene4

The desired output is as follows:

Column A Column B
gene1-ID001 gene1-ID001
gene2-ID002 gene2-ID002
gene3-ID003 gene3-ID003
gene4-ID004 gene4-ID004

The names/gene IDs in each data set are tied to different values and are not arranged in a specific order.

Happy to learn about any established tools for these kinds of issues as well. Thanks you.

I'm fairly certain I need to use grep in some manner to check for partial matches before setting values, but the finesse in setting this up is something I'm lacking. Handling issues when there are more than one partial matches and picking the "most matching" is a problem.

Upvotes: 1

Views: 51

Answers (1)

Onyambu
Onyambu

Reputation: 79208

You could use agrep with max distance of 0.

df$C = df[sapply(df[[2]], agrep, df[[1]], max.distance = 0),1]
df
     Column_A    Column_B           C
1 gene1-ID001       gene1 gene1-ID001
2 gene2-ID002       ID002 gene2-ID002
3 gene3-ID003 gene3-ID003 gene3-ID003
4 gene4-ID004       gene4 gene4-ID004

You can use adist with partial matching:

df$C <- df[max.col(-adist(df[[2]], df[[1]], partial = TRUE)), 1]
df

     Column_A    Column_B           C
1 gene1-ID001       gene1 gene1-ID001
2 gene2-ID002       ID002 gene2-ID002
3 gene3-ID003 gene3-ID003 gene3-ID003
4 gene4-ID004       gene4 gene4-ID004

Upvotes: 0

Related Questions