Reputation: 1023
I want to extract the text from a column in a dataframe that looks something like this:
genes=TraesCS5A01G391700;is_HC;ANN=A|missense_variant|MODERATE|TraesCS5A01G391700|TraesCS5A01G391700|transcript|TraesCS5A01G391700.1|protein_coding|7/8|c.539C>T|p.Ala180Val|539/735|539/735|180/244||,A|missense_variant|MODERATE|TraesCS5A01G391700|TraesCS5A01G391700|transcript|TraesCS5A01G391700.2|protein_coding|7/7|c.562C>T|p.Arg188Trp|562/621|562/621|188/206||
What I want to get is the first occurrence of the text between |
In this example is: missense_variant
. I want the results in a list. I was trying something like these:
res_ann <- rm_between(vcf_ann$INFO, "|", "|", extract=TRUE)
str_extract(vcf_ann$INFO, regex(""))
The first case returns me all the results between |
, and the second, well couldn't match with any regex.
Upvotes: 1
Views: 300
Reputation: 626932
You may use
str_extract(vcf_ann$INFO, "(?<=\\|)[^|]+(?=\\|)")
or even (if you do not need to check for the trailing |
):
str_extract(vcf_ann$INFO, "(?<=\\|)[^|]+")
Details
str_extract
obtains the first match from the given string(?<=\\|)
- a positive lookbehind that requires the presence of |
immediately to the left of the current location[^|]+
- 1 or more chars other than |
(?=\\|)
- a positive lookbahead that requires the presence of |
immediately to the right of the current location.Upvotes: 2