Regex using back references in R

Question

I wrote regex in https://regex101.com/r/R8ObNk/1 (^[^\]*)\t([^\]*)\t([^\]*)\t([^\]*)\t([^\]*)(.*) with a back reference to capture group 5 or "\5".

For some reason, when I try to use the regex above that I wrote in R using gsub, I am not returning the correct data.

Here is the dput for first line of the data that I am trying to back reference:

structure(list(value = "19-22		4	P,G	DOB_TT		Time of Birth		126	 	0000-2359 Time of Birth"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

This is the gsub on the line above: gsub(pattern = "(^[^\]*)\t([^\]*)\t([^\]*)\t([^\]*)\t([^\]*)(.*)", replacement = "\5", x = a$value). I do know you're supposed to add another "\" when working with regex in R, but still that didn't work.

The intended result of the gsub should be "DOB_TT" or the 5th capture group

MrFlick · Accepted Answer

You need to be careful with escape characters. Note that R uses extra "" in strings that will not be understood by the website. And when you see a string like

x <- "a	b"

in R, there is no literal slash in the string. The is the escape for a tab character. So nchar(x) return 3, not 4 because those two values together make one tab character. So given your data, what you really want is

gsub(pattern = "(^[^	]*)	([^	]*)	([^	]*)	([^	]*)	([^	]*)(.*)",
  replacement = "\5", x = a$value)

You do not need extra \ for the tabs because tab characters aren't special in a regular expression. They are just regular characters.

Regex using back references in R

Answers (2)

Related Questions