Bastien
Bastien

Reputation: 3098

Ignore part of a string when splitting using regular expression in R

I'm trying to split a string in R (using strsplit) at some specific points (dash, -) however not if the dash are within a string in brackets ([).

Example:

xx <- c("Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
xx
  [1] "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
  [2] "Total Internet-Time Spent Online-Past 7 Days" 

should give me something like:

list(c("Radio Stations","Listened to Past Week","Toronto [FM-CFXJ-93.5 (93.5 The Move)]"), c("Total Internet","Time Spent Online","Past 7 Days"))
  [[1]]
  [1] "Radio Stations"                         "Listened to Past Week"                 
  [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"

  [[2]]
  [1] "Total Internet"    "Time Spent Online" "Past 7 Days"  

Is there a way with regular expression to do this? The position and the number of dashs change within each elements of the vector, and there is not always brackets. However, when there are brackets, they are always at the end.

I've tried different things, but none are working:

## Trying to match "-" before "[" in Perl
strsplit(xx, split = "-(?=\\[)", perl=T)
# does nothing

## trying to first extract what follow "[" then splitting what is preceding that
temp <- strsplit(xx, "[", fixed = T)
temp <- lapply(temp, function(yy) substr(head(yy, -1),"-"))
# doesn't work as there are some elements with no brackets...

Any help would be appreciated.

Upvotes: 3

Views: 1995

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269654

1) gsubfn Assuming square brackets are balanced and not nested, gsubfn locates each [...] and within those uses gsub to convert dashes to exclamation marks. We then split what is left on the remaining dashes and replace the exclamation marks with dashes.

The regular expression means match a [ followed by the shortest string until the next ].

library(gsubfn)

s <- strsplit(gsubfn("\\[.*?\\]", ~ gsub("-", "!", x), xx), "-")
lapply(s, gsub, pattern = "!", replacement = "-")

which could be expressed using a magrittr pipeline:

library(gsubfn)
library(magrittr)

xx %>%
   gsubfn(pattern = "\\[.*?\\]", replacement = ~ gsub("-", "!", x)) %>%
   strsplit("-") %>%
   lapply(gsub, pattern = "!", replacement = "-")

2) readLines This alternative uses no packages, does not use strsplit and uses only simple fixed regular expressions. It also assumes balanced non-nested square brackets.

Using gsub it first prepends each [ with a newline and suffixes each ] with a new line. Then for each input string it reads the result into r, and for the odd positioned strings replaces dash with newline. Finally it pastes r back together again and re-reads it which has the effect of splitting it at the newlines (which were previously dashes.

lapply(gsub("\\]", "]\n", gsub("\\[", "\n[", xx)), function(x) {
   r <- readLines(textConnection(x))
   i  <- seq(1, length(r), 2)
   r[i] <- gsub("-", "\n", r[i])
   readLines(textConnection(paste(r, collapse = "")))
})

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

To match a - that is not inside [ and ] you must match a part of the string that is enclosed with [ and ] and omit it, and match - in all other contexts. In abc-def], the - is not in between [ and ] and acc. to the specs should not be split against.

It is done with this regex:

\[[^][]*](*SKIP)(*FAIL)|-

Here,

  • \[ - matches a [
  • [^][]* - zero or more chars other than [ and ] (if you use [^]] it will match any char but ])
  • ] - a literal ]
  • (*SKIP)(*FAIL)- PCRE verbs that omit the match and make the engine go on looking for the match after the end of the omitted one
  • | - or
  • - - a hyphen in other contexts.

Or, to match [...[...] like substrings (demo):

\[[^]]*](*SKIP)(*FAIL)|-

Or, to account for nested square brackets (demo):

(\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-

Here, (\[(?:[^][]++|(?1))*]) matches and captures [, then 1+ chars other than [ and ] (with [^][]++) or (|) (?1) recurses the whole capturing group 1 pattern (the whole part between (...)).

See the R demo:

xx <- c("abc-def]", "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
pattern <- "\\[[^][]*](*SKIP)(*FAIL)|-"
strsplit(xx, pattern, perl=TRUE)
# [[1]]
# [1] "abc"  "def]"
# [[2]]
# [1] "Radio Stations"                        
# [2] "Listened to Past Week"                 
# [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
# [[3]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"      

pattern_recursive <- "(\\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-"
xx2 <- c("Radio Stations-Listened to Past Week-Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
strsplit(xx2, pattern_recursive, perl=TRUE)
# [[1]]
# [1] "Radio Stations"                            
# [2] "Listened to Past Week"                     
# [3] "Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]"

# [[2]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"   

Upvotes: 2

talat
talat

Reputation: 70266

Based on: Regex for matching a character, but not when it's enclosed in square bracket

You can use:

strsplit(xx, "-(?![^\\[]*\\])", perl = TRUE)
[[1]]
[1] "Radio Stations"                         "Listened to Past Week"                 
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"

[[2]]
[1] "Total Internet"    "Time Spent Online" "Past 7 Days" 

Upvotes: 3

Related Questions