Reputation: 15
I need to extract a specific number from strings in a vector that look like this:
V1 V2 info
XX YY AB=414312;CD=0.5555;EF=1234;GH=2346;IJ=551;AA_CD=0.4633
VV ZZ AB=1093;CD=0.4444,0.78463;EF=1654;GH=6546;IJ=1241;AA_CD=0.4366
I only want to extract the number from "CD=XXX" (notice there is also a "AA_CD=XXXX" in every row)
I currently have:
df$info <- as.numeric(gsub("^.*;CD=[0-9, ],?|;.*$", "", df$info))
Which grabs the number after "CD=" in instances where there is not more than one number separated by a comma.
I need this to include the rows in which there are more than one number separated by commas. My regex only works for rows in which there is only one number in that spot, like so:
0.5555
0.4444,0.78463
0.0123
0.34,0.54,0.765
I know it is probably a silly mistake I am making...Thanks in advance!!!
Upvotes: 1
Views: 312
Reputation: 19716
Here is an approach
lapply(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ","), as.numeric)
gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec) #extracts the numbers
#output
1] "0.5555" "0.4444,0.78463"
these are then split at ,
with strsplit
producing a list
then as.numeric
converts the list elements with lapply
if it is not needed to keep track of which vector member had which numbers:
as.numeric(unlist(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ",")))
Upvotes: 1