Reputation: 1112
I have a string list as below:
df = read.table(text="AC1=60;AD=393,115;AF1=0.318816;BQB=0.508823;DP=1016;DP4=393
AC1=190;AD=2,747;AF1=1;BQB=0.0722892;DP=749;DP4=2,0,747,0;FQ=-43.6844
AC1=150;AD=1,5;AF1=0.787353;DP=6;DP4=1,0,5,0;VDB=0.00215942
AC1=47;AD=660,182;AF1=0.24862;BQB=0.680047;DP=1684;DP4=660,0,182,0
AC1=47;AD=659,183;AF1=0.248425;DP=842;DP4=0,659,0,183;FQ=999
AC1=78;AD=23,17;AF1=0.408247;BQB=1;DP=40;DP4=23,0,17,0", header=FALSE, stringsAsFactors=F)
each element is separated by ";". I would like to extract out only "DP=[0-9]" part. The result is expected as:
DP=1016
DP=749
DP=6
DP=1684
DP=842
DP=40
I appreciate any helps.
Upvotes: 0
Views: 49
Reputation: 38500
Here is one regular expression that will work
gsub(".*;(DP=[0-9.]+);.*$", "\\1", df$V1)
If it's the case that the "DP=" substring contains multiple entries separated by commas, as do substrings like "DP4= " in some cases in the example data, then as @pierre-lafortune notes in the comments below, and in his answer, you might be better off with the [^;] character class:
gsub(".*;(DP=[^;]+);.*$", "\\1", df$V1)
Of course, you could just add the comma to the character class,
gsub(".*;(DP=[0-9.,]+);.*$", "\\1", df$V1)
but there may be other characters you want to keep as well. So [^;] would be the most inclusive approach.
Upvotes: 1
Reputation: 28441
In base:
gsub(".*((?<=;)DP=[^;]+(?=;)).*", "\\1", df$V1, perl=TRUE)
#[1] "DP=1016" "DP=749" "DP=6" "DP=842" "DP=1684" "DP=40"
I was surprised when the resident genius on regex suggested the use packages for text extraction. sub
and gsub
can get unruly when pulling out a specific string:
library(stringr)
str_extract_all(df$V1, "(?<=;)DP=[^;]+(?=;)")
Upvotes: 2