user1532587
user1532587

Reputation: 1023

How to extract text between two | in R dataframe with a regex

I want to extract the text from a column in a dataframe that looks something like this:

genes=TraesCS5A01G391700;is_HC;ANN=A|missense_variant|MODERATE|TraesCS5A01G391700|TraesCS5A01G391700|transcript|TraesCS5A01G391700.1|protein_coding|7/8|c.539C>T|p.Ala180Val|539/735|539/735|180/244||,A|missense_variant|MODERATE|TraesCS5A01G391700|TraesCS5A01G391700|transcript|TraesCS5A01G391700.2|protein_coding|7/7|c.562C>T|p.Arg188Trp|562/621|562/621|188/206||

What I want to get is the first occurrence of the text between | In this example is: missense_variant. I want the results in a list. I was trying something like these:

res_ann <- rm_between(vcf_ann$INFO, "|", "|", extract=TRUE)
str_extract(vcf_ann$INFO, regex(""))

The first case returns me all the results between |, and the second, well couldn't match with any regex.

Upvotes: 1

Views: 300

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

You may use

str_extract(vcf_ann$INFO, "(?<=\\|)[^|]+(?=\\|)")

or even (if you do not need to check for the trailing |):

str_extract(vcf_ann$INFO, "(?<=\\|)[^|]+")

Details

  • str_extract obtains the first match from the given string
  • (?<=\\|) - a positive lookbehind that requires the presence of | immediately to the left of the current location
  • [^|]+ - 1 or more chars other than |
  • (?=\\|) - a positive lookbahead that requires the presence of | immediately to the right of the current location.

Upvotes: 2

Related Questions