accibio
accibio

Reputation: 547

Replacing elements in a column of a dataframe by using regular expressions in R

df is a test dataframe and is a subset of my original dataframe which has ~1000000 rows and 21 columns.

df <- data.frame(
 Hits = c("# test1", "#  Query: tr|A4I9M8|A4I9M8_LEIIN", "# 13135", "tr|E9BQL4|E9BQL4_LEIDB", 
       "tr|A4I9M8|A4I9M8_LEIIN", "tr|A0A3Q8IUE6|A0A3Q8IUE6_LEIDO", "tr|Q4Q3E9|Q4Q3E9_LEIMA", 
       "tr|A0A640KX53|A0A640KX53_LEITA", "# test2", "# Query: tr|E9AH01|E9AH01_LEIIN", "# 83771", 
       "tr|A0A6L0XNG2|A0A6L0XNG2_LEIIN", "tr|E9AH01|E9AH01_LEIIN", "tr|A0A6J8FCW4|A0A6J8FCW4_LEIDO", 
       "tr|A0A6J8FCW4|A0A6J8FCW4_LEIDO"),
 Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
 Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
 Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))

df looks something like this

enter image description here

In the Hits column, the elements which don't start with a # are to be replaced by the portion lying between the first two occurrences of |. The regular expression which I came up with to extract this portion is

^.*?(\\b[A-Z][^|]*).*

The output should look like this

enter image description here

I can't seem to figure out how to replace the elements with the extracted portions. I can think of using conditional loops in this case. But considering the size of the original dataframe, I'm not sure if that would be an efficient way to deal with this as loops tend to be slower in R. Can somebody suggest an alternative way, preferably a vectorized solution to solve this issue?

Upvotes: 1

Views: 85

Answers (1)

benson23
benson23

Reputation: 19097

You can use gsub() inside mutate() to do the job.

library(tidyverse)

# my original answer
df %>% mutate(Hits = gsub("^[^#].+?((?<=\\|).+?(?=\\|)).*", "\\1", Hits, perl = T))

Or

# OP's regex
df %>% mutate(Hits = gsub("^[^#].*?(\\b[A-Z][^\\|]*).*", "\\1", Hits, perl = T))

Both generate the same output.

Output

# A tibble: 15 x 4
   Hits                             Category1 Category2 Category3
   <chr>                                <dbl>     <dbl>     <dbl>
 1 # test1                             NA            NA        NA
 2 #  Query: tr|A4I9M8|A4I9M8_LEIIN    NA            NA        NA
 3 # 13135                             NA            NA        NA
 4 E9BQL4                               0.001       100       100
 5 A4I9M8                               0.001       100       100
 6 A0A3Q8IUE6                           0.002        99        99
 7 Q4Q3E9                               0.003        98        98
 8 A0A640KX53                           0.003        98        98
 9 # test2                             NA            NA        NA
10 # Query: tr|E9AH01|E9AH01_LEIIN     NA            NA        NA
11 # 83771                             NA            NA        NA
12 A0A6L0XNG2                           0.023       100        98
13 E9AH01                               0.341        95        97
14 A0A6J8FCW4                           0.341        95        97
15 A0A6J8FCW4                           0.569        97        92

Upvotes: 1

Related Questions