jrara
jrara

Reputation: 17001

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?

st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."

First of all, this generates an error

Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"

I was thinking something like

sub(".*\\.", st, "")

Upvotes: 2

Views: 1530

Answers (3)

Andrie
Andrie

Reputation: 179468

Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.

However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.

Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.

gsub("\\.", "", gsub("(.+?\\.){2}", "", st)) 
[1] "DATABASE_NAME"

Upvotes: 2

Andrie
Andrie

Reputation: 179468

An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:

st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."

library(stringr)

str_split(st, "\\.")[[1]][3]

[1] "DATABASE_NAME"

Upvotes: 1

Gavin Simpson
Gavin Simpson

Reputation: 174853

The first problem is that you need to escape the \ in your string:

st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."

As for the main problem, this will return the bit you want from the string you gave:

> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"

But a simpler solution would be to split on the \\. and select the last chunk:

> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"

or slightly more automated

> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"

Upvotes: 3

Related Questions