Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Regex in R to match strings in square brackets

I have transcripts of storytellings with many instances of overlapped speech indicated by square brackets wrapped around the speech in overlap. I want to extract these instances of overlap. In the following mock example,

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")

this code works fine:

pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T) 
matches <- gregexpr(pattern, ovl) 
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

But in a larger file, a dataframe, it doesn't. Is this due to a mistake in the pattern or could it be due to how the dataframe is structured? The first six lines of the df look like this:

> head(df)
                                                             Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2                                             June:\tI know he is.
3                                 Kar:\tblack welding glasses on, 
4                        \tand he turned round and he made me jump
5                                                 \t“O:h, Colin”, 
6                                  \tand then (                  )

Upvotes: 4

Views: 4270

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

To match strings between [ and ] with no square brackets in between use

"\\[[^][]*]"

It will match [a] in [a[a] string, unlike the \[.*?] pattern.

Details

  • \[ - a [ char
  • [^][]* - a negated bracket expression (or character class) that matches any 0 or more chars other than [ and ]
  • ] - a ] char (there is no need escaping it outside of a character class/bracket expression)

See the Regulex graph:

enter image description here

See the R demo online:

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(regmatches(ovl, gregexpr("\\[[^][]*]", ovl)))
## => [1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

With stringr::str_extract_all:

library(stringr)
ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(str_extract_all(ovl, "\\[[^\\]\\[]*]"))
## => [1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

Here, as the pattern is handled with ICU regex library, you need to escape both square brackets in the regex pattern.

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520958

Though it might be working in certain cases, your pattern looks off to me. I think it should be this:

pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean

[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

Demo

This would match and capture a bracketed term, using the Perl lazy dot to make sure we stop at the first closing bracket.

Upvotes: 5

Related Questions