Reputation: 21400
I have transcripts of storytellings with many instances of overlapped speech indicated by square brackets wrapped around the speech in overlap. I want to extract these instances of overlap. In the following mock example,
ovl <- c("well [yes right]", "let's go", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
this code works fine:
pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T)
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]" "[ we::ll]" "[°well right° ]"
But in a larger file, a dataframe, it doesn't. Is this due to a mistake in the pattern or could it be due to how the dataframe is structured? The first six lines of the df look like this:
> head(df)
Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2 June:\tI know he is.
3 Kar:\tblack welding glasses on,
4 \tand he turned round and he made me jump
5 \t“O:h, Colin”,
6 \tand then ( )
Upvotes: 4
Views: 4270
Reputation: 626728
To match strings between [
and ]
with no square brackets in between use
"\\[[^][]*]"
It will match [a]
in [a[a]
string, unlike the \[.*?]
pattern.
Details
\[
- a [
char[^][]*
- a negated bracket expression (or character class) that matches any 0 or more chars other than [
and ]
]
- a ]
char (there is no need escaping it outside of a character class/bracket expression)See the Regulex graph:
See the R demo online:
ovl <- c("well [yes right]", "let's go", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(regmatches(ovl, gregexpr("\\[[^][]*]", ovl)))
## => [1] "[yes right]" "[ we::ll]" "[°well right° ]"
With stringr::str_extract_all
:
library(stringr)
ovl <- c("well [yes right]", "let's go", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(str_extract_all(ovl, "\\[[^\\]\\[]*]"))
## => [1] "[yes right]" "[ we::ll]" "[°well right° ]"
Here, as the pattern is handled with ICU regex library, you need to escape both square brackets in the regex pattern.
Upvotes: 1
Reputation: 520958
Though it might be working in certain cases, your pattern looks off to me. I think it should be this:
pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean
[1] "[yes right]" "[ we::ll]" "[°well right° ]"
This would match and capture a bracketed term, using the Perl lazy dot to make sure we stop at the first closing bracket.
Upvotes: 5