Praderas
Praderas

Reputation: 481

Clean the gene names in a dataframe

I have an R dataframe that looks like this:

       Gene Symbol       Prom 1       Prom 2    Prom 3
 1 Gm16088  // Gm16088    7.168819  7.410547  7.634662
 2             Gm26206    7.006416  6.824151  6.941721
 3   Gm1992  // Gm1992    6.750240  6.591182  6.479798
 4             Gm10568    4.390371  4.496734  4.672061
 5             Gm22307   13.196217 13.157953 13.601210
 6 Gm16041  // Gm16041    5.146015  5.450036  5.388205
 7 Gm17101  // Gm17101    6.434086  6.752058  6.603427

In the gene symbol column, I have some gene symbols that are repeated several times inside the same cell of the dataframe. In some lines, the gene symbol is repeated a hundred of times. Is there a way to solve this, in order to have the lines like this:

Gene Symbol       Prom 1       Prom 2    Prom 3
 1 Gm16088       7.168819  7.410547  7.634662

Instead of having them like this:

Gene Symbol       Prom 1       Prom 2    Prom 3
 1 Gm16088  // Gm16088    7.168819  7.410547  7.634662

Upvotes: 2

Views: 122

Answers (2)

akrun
akrun

Reputation: 887128

We could also use word

library(stringr)
word(x, 1)
#[1] "Gm16088" "Gm26206"

data

 x <- c("Gm16088  // Gm16088", "Gm26206")

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521279

You could try using gsub():

x <- "Gm16088  // Gm16088"

> gsub("\\s*//.*", "", x)
[1] "Gm16088"

In your actual code, you would replace x with:

df$`Gene Symbol`

where df is the name of the data frame.

Upvotes: 3

Related Questions