Get the characters after a certain pattern in R - regex

Question

I have a data frame with one column:

df <- data.frame(cat = c("c(\"BPT\", "BP")", "c("BP2", "BP")", "c("BPT", "BP")", "c("CN", "NC")"))
df$cat <- as.character(df$cat)
df$cat

How can I extract the characters that appear after c(", sometimes there is only one backslash and sometimes there's 2. Similarly with the characters, sometimes the characters are 2 and sometimes they are 3. e.g. BP2, BP etc.

So far I have tried:

substr(x = df$cat, start = 4, stop = 6)

But this results in:

 ""BP" "BP2"  "BPT"  "CN""

And I only want the output to show

"BPT" "BP2"  "BPT"  "CN"

Wiktor Stribiżew · Accepted Answer

You may use

df <- data.frame(cat = c("c(\"BPT\", "BP")", "c("BP2", "BP")", "c("BPT", "BP")", "c("CN", "NC")"))
df$cat <- as.character(df$cat)
unlist(lapply(gsub('\', '', df$cat, fixed=TRUE), function(x) eval(parse(text=x))[[1]]))
## => [1] "BPT" "BP2" "BPT" "CN"

See the R demo online.

Notes

gsub('\', '', df$cat, fixed=TRUE) removes all backslashes. You may use gsub('\"', '"', df$cat, fixed=TRUE) if you only plan to remove backslashes before ".
eval(parse(text=x))[[1]] parses the vector and returns the first item
lapply helps traverse the whole data you have. See Using sapply and lapply.

Get the characters after a certain pattern in R - regex

Answers (1)

Related Questions