frogatto
frogatto

Reputation: 29285

String split using RegEx in R

Assume we've got the following string.

str <- '<a><b><c>';

I'd need to split it so that the output is a vector of 'a', 'b', 'c'.

Essentially I'd probably need a RegEx split function that takes out instances of <(*)> from the original string and add them in a new vector.

Upvotes: 2

Views: 159

Answers (5)

G. Grothendieck
G. Grothendieck

Reputation: 269461

1) strsplit/gsub Remove the < characters and then split by > like this. (One might have expected that this would leave a zero character component at the end but fortunately because of the way strsplit works this does not occur.) This solution is particularly short and uses no packages.

unlist(strsplit(gsub("<", "", str), ">"))
## [1] "a" "b" "c"

2) scan/chartr Replace < and > characters with a space and then use scan to read in what is left. This solution uses no packages and is particularly straight-forward but depends on the fields not containing spaces:

scan(textConnection(chartr("<>", "  ", str)), what = "", quiet = TRUE)
## [1] "a" "b" "c"

3) strapplyc strapplyc in the gsubfn package extracts the fields that match a regular expression. (stringr package also provides a similar function and base R provides regmatches which can also do this too but a bit awkwardly.) This solution is very short but does use a package.

library(gsubfn)

strapplyc(str, "[^<>]+", simplify = c)
[1] "a" "b" "c"

Upvotes: 2

Giuseppe Ricupero
Giuseppe Ricupero

Reputation: 6272

You can split using strsplit and a regex /[<>]+/ and then filter out all the empty results with lapply:

str <- '<ab><bc><cd>'
unlist(lapply(strsplit(str,"[<>]+"), function(x){x[!x ==""]}))
//[1] "ab" "bc" "cd"

Or simply remove the first empty arg:

unlist(strsplit(str,"[<>]+"))[-1]
//[1] "ab" "bc" "cd"

Upvotes: 1

akrun
akrun

Reputation: 886948

We can use str_extract_all

library(stringr)
str_extract_all(str2, '[a-z]+')[[1]]
#[1] "ab" "bc" "cd"

Upvotes: 1

Jaap
Jaap

Reputation: 83215

str <- '<a><b><c>'
str <- gsub('<|>','',str)
str <- unlist(strsplit(str,'',fixed=TRUE))  # or: strsplit(str,'',fixed=TRUE)[[1]]

gives:

> str
[1] "a" "b" "c"

In respons to your comment:

str2 <- '<ab><bc><cd>'
str2 <- unlist(strsplit(str2,'><',fixed=TRUE))  # or: strsplit(str2,'><',fixed=TRUE)[[1]]
str2 <- gsub('<|>','',str2)

gives:

> str2
[1] "ab" "bc" "cd"

Upvotes: 4

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

First, gsub '><' for something else. I chose a space. This is what you will strsplit on later. Then, then remove '>' and '<'. You can then strsplit on space. Use unlist if needed.

str1 <- '<a><b><c>';
str1 <-gsub('><',' ',str1)
str1 <-gsub('>|<','',str1)
strsplit(str1,' ')
#"a" "b" "c"

Upvotes: 1

Related Questions