Reputation: 547
df
is a test dataframe and is a subset of my original dataframe which has ~1000000 rows and 21 columns.
df <- data.frame(
Hits = c("# test1", "# Query: tr|A4I9M8|A4I9M8_LEIIN", "# 13135", "tr|E9BQL4|E9BQL4_LEIDB",
"tr|A4I9M8|A4I9M8_LEIIN", "tr|A0A3Q8IUE6|A0A3Q8IUE6_LEIDO", "tr|Q4Q3E9|Q4Q3E9_LEIMA",
"tr|A0A640KX53|A0A640KX53_LEITA", "# test2", "# Query: tr|E9AH01|E9AH01_LEIIN", "# 83771",
"tr|A0A6L0XNG2|A0A6L0XNG2_LEIIN", "tr|E9AH01|E9AH01_LEIIN", "tr|A0A6J8FCW4|A0A6J8FCW4_LEIDO",
"tr|A0A6J8FCW4|A0A6J8FCW4_LEIDO"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))
df
looks something like this
In the Hits
column, the elements which don't start with a #
are to be replaced by the portion lying between the first two occurrences of |
. The regular expression which I came up with to extract this portion is
^.*?(\\b[A-Z][^|]*).*
The output should look like this
I can't seem to figure out how to replace the elements with the extracted portions. I can think of using conditional loops in this case. But considering the size of the original dataframe, I'm not sure if that would be an efficient way to deal with this as loops tend to be slower in R. Can somebody suggest an alternative way, preferably a vectorized solution to solve this issue?
Upvotes: 1
Views: 85
Reputation: 19097
You can use gsub()
inside mutate()
to do the job.
library(tidyverse)
# my original answer
df %>% mutate(Hits = gsub("^[^#].+?((?<=\\|).+?(?=\\|)).*", "\\1", Hits, perl = T))
Or
# OP's regex
df %>% mutate(Hits = gsub("^[^#].*?(\\b[A-Z][^\\|]*).*", "\\1", Hits, perl = T))
Both generate the same output.
# A tibble: 15 x 4
Hits Category1 Category2 Category3
<chr> <dbl> <dbl> <dbl>
1 # test1 NA NA NA
2 # Query: tr|A4I9M8|A4I9M8_LEIIN NA NA NA
3 # 13135 NA NA NA
4 E9BQL4 0.001 100 100
5 A4I9M8 0.001 100 100
6 A0A3Q8IUE6 0.002 99 99
7 Q4Q3E9 0.003 98 98
8 A0A640KX53 0.003 98 98
9 # test2 NA NA NA
10 # Query: tr|E9AH01|E9AH01_LEIIN NA NA NA
11 # 83771 NA NA NA
12 A0A6L0XNG2 0.023 100 98
13 E9AH01 0.341 95 97
14 A0A6J8FCW4 0.341 95 97
15 A0A6J8FCW4 0.569 97 92
Upvotes: 1