ProgrammerOliv
ProgrammerOliv

Reputation: 29

extract every element that meets a pattern in a string in R

I have a string, basically it's a SQL statement. I want to extract some part of it. Here is the code

 SELECT 
 DTE as "Date",
 CURRENT_DATE AS "Day",
 concat( BCCO, BCBCH ) AS "client/batch",
 BCSTAT as "Batch Status",
 CASE 
  WHEN EXC = 'MCR' THEN CNT 
  ELSE 0 
 END AS "MCR-NPR",
 CASE 
  WHEN EXC = 'NRC' THEN CNT 
  ELSE 0 
 END AS "NRC-NPR",
 CASE 
  WHEN EXC = 'OFD' THEN CNT 
  ELSE 0 
 END AS "OFD-NPR",
 CASE 
  WHEN EXC = 'TDB' THEN CNT 
  ELSE 0 
 END AS "TDB-NPR",
 CASE 
  WHEN EXC = 'TDC' THEN CNT 
  ELSE 0 
 END AS "TDC-NPR",
 CASE 
  WHEN EXC = 'UDC' THEN CNT 
  ELSE 0 
 END AS "UDC-NPR",
 CASE 
  WHEN EXC = 'BIN' THEN CNT 
  ELSE 0 
 END AS "BIN-WRN",
 CASE 
  WHEN EXC = 'DSP' THEN CNT 
  ELSE 0 
 END AS "DSP-WRN",

I want to extract every element between END AS and the quote. A vector like ("MCR-NPR",...,"DSP-WRN") will be the desire output.

I know I may need to use regular expression, but I couldn't extract every one of them.

Any idea will be appreciated.

Best,

Upvotes: 1

Views: 90

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 270248

1) grep/read.table grep out lines with END AS and use read.table with a sep of double quote to read those. The second column will be the desired data. No regular expressions or packages are used.

read.table(text = grep("END AS", s, value = TRUE, fixed = TRUE), 
  sep = '"', as.is = TRUE)[[2]]
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

1a) This is similar to (1) but uses sub with a regular expression instead of read.table:

sub('.*END AS "(.+)".*', "\\1", grep("END AS", s, value = TRUE))
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

2) strapply Another approach is the following. It makes use of the fact that the desired strings follow END AS and are surrounded with double quotes It has the shortest code of the ones shown here.

library(gsubfn)
unlist(strapplyc(s, 'END AS "(.+)"'))
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

3) strcapture Another base R approach using the same pattern as in (2) is:

na.omit(strcapture('END AS "(.+)"', s, list(value = character(0))))

giving:

     value
9  MCR-NPR
13 NRC-NPR
17 OFD-NPR
21 TDB-NPR
25 TDC-NPR
29 UDC-NPR
33 BIN-WRN
37 DSP-WRN

Note

The input s in reproducible form:

s <- 
c("SELECT ", " DTE as \"Date\",", " CURRENT_DATE AS \"Day\",", 
" concat( BCCO, BCBCH ) AS \"client/batch\",", " BCSTAT as \"Batch Status\",", 
" CASE ", "  WHEN EXC = 'MCR' THEN CNT ", "  ELSE 0 ", " END AS \"MCR-NPR\",", 
" CASE ", "  WHEN EXC = 'NRC' THEN CNT ", "  ELSE 0 ", " END AS \"NRC-NPR\",", 
" CASE ", "  WHEN EXC = 'OFD' THEN CNT ", "  ELSE 0 ", " END AS \"OFD-NPR\",", 
" CASE ", "  WHEN EXC = 'TDB' THEN CNT ", "  ELSE 0 ", " END AS \"TDB-NPR\",", 
" CASE ", "  WHEN EXC = 'TDC' THEN CNT ", "  ELSE 0 ", " END AS \"TDC-NPR\",", 
" CASE ", "  WHEN EXC = 'UDC' THEN CNT ", "  ELSE 0 ", " END AS \"UDC-NPR\",", 
" CASE ", "  WHEN EXC = 'BIN' THEN CNT ", "  ELSE 0 ", " END AS \"BIN-WRN\",", 
" CASE ", "  WHEN EXC = 'DSP' THEN CNT ", "  ELSE 0 ", " END AS \"DSP-WRN\"")

Upvotes: 2

Related Questions