Reputation: 1217
Let's say arg is the following:
\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck, Lando\r\n\r\n\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter, we'll contact \r\n\t\tyou.\r\n\r\n
I'm trying to use the code below to extract all strings in arg that start with "\t|\n|\r" + several capital letters and end with "\r\n\r\n", but am getting no matches:
str_extract_all(arg, "(\t|\n|\r)[A-Z]{1}.*?[A-Z]{2}(\r\n\t\t\t).*?(?=(\r\n\r\n))")
I'd expect the results for this code to be "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck, Lando\r\n\r\n" and "\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter, we'll contact \r\n\t\tyou.\r\n\r\n".
When I leave off the positive look-ahead on the end, the matching otherwise works and I'm returned "\tLUKE\r\n\t\t\t" and "\tLANDO\r\n\t\t\t" as expected.
str_extract_all(arg, "(\t|\n|\r)[A-Z]{1}.*?[A-Z]{2}(\r\n\t\t\t).*?")
What am I missing here?
Upvotes: 1
Views: 175
Reputation: 163207
You can omit omit the capture groups if you don't need the value afterwards. Also {1}
is superfluous and can be removed.
Using a pattern line .*?
with only at the end will not yield any matches as the quantifier is non greedy and there is no rule after it to have it give up any matches.
To keep the pattern less strict, you can use quantifiers instead of specifying the exact number of tabs and newlines.
To prevent unnecessary backtracking, you could match the line that consists only of uppercase chars, followed by matching all lines that do not.
^[^\S\r\n]+[A-Z]+(?:\r?\n(?![^\S\r\n]*[A-Z]+$).*)*
^
Start of string[^\S\r\n]+
Match 1+ times a whitespace char without a newline[A-Z]+
Match 1+ uppercase chars(?:
Non capture group
\r?\n(?![^\S\r\n]*[A-Z]+$
)` Match a newline and assert that the line does not has a single only uppercased word.*
If the previous assertion is true, match the wholeline)*
Close group and repeat 0+ times to match all linesExample using (?m)
for multiline
library(stringr)
arg <- "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck, Lando\r\n\r\n\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter, we'll contact \r\n\t\tyou.\r\n\r\n"
str_extract_all(arg, "(?m)^[^\\S\\r\\n]+[A-Z]+(?:\\r?\\n(?![^\\S\\r\\n]*[A-Z]+$).*)*")
Output
[[1]]
[1] "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck, Lando\r\n"
[2] "\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter, we'll contact \r\n\t\tyou.\r\n\r\n"
Upvotes: 3
Reputation: 4658
By default, the dot (.
) does not match newlines (see the dotall
option in help(stri_opts_regex)
for example), which is why the .*?
part does not capture what you want. You can enable this via the (?s)
flag:
str_extract_all(arg, "(?s)(\t|\n|\r)[A-Z]{1,}(\r\n\t\t\t).*?(?=\r\n\r\n)")
[[1]]
[1] "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck, Lando"
[2] "\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter, we'll contact \r\n\t\tyou."
Upvotes: 3