Reputation: 2257
Suppose I have a few lines out of wikipedia XML that looks like this:
[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]
I want to remove the line that begins with [[Image:" and closed by "observances]]
.
There could be several other lines of text that have brackets as well and I don't want to do a greedy search otherwise it may accidentally remove those other brackets too.
For example, if I just did a greedy \\[\\[Image:.*\\]\\]
, I believe it will remove everything up to the last closing brackets (Ericco Malatesta)
Is there a regular expression that can make this easier for me?
Upvotes: 2
Views: 217
Reputation: 15502
Using the following test string (note, I added an additional [[image:foobar[[foo [baz] bar]]foobar]]
in there):
[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of[[image:foobar[[foo [baz] bar]]foobar]] Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]
And a regular expression pattern of:
(?i)\\[\\[image:(?:\\[\\[(?:(?!(?:\\[\\[|]])).)*]]|(?:(?!(?:\\[\\[|]])).)*?)*?]]
testString.replaceAll(<above pattern>, "")
will return:
In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]
Here's a more detailed explanation of the regular expression:
(?i) # Case insensitive flag
\[\[image: # Match literal characters '[[image:'
(?: # Begin non-capturing group
\[\[ # Match literal characters '[['
(?: # Begin non-capturing group
(?! # Begin non-capturing negative look-ahead group
(?: # Begin non-capturing group
\[\[ # Match literal characters '[['
| # Match previous atom or next atom
]] # Match literal characters ']]'
) # End non-capturing group
) # End non-capturing negative look-ahead group
. # Match any character
) # End non-capturing group
* # Match previous atom zero or more times
]] # Match literal characters ']]'
| # Match previous atom or next atom
(?: # Begin non-capturing group
(?! # Begin non-capturing negative look-ahead group
(?: # Begin non-capturing group
\[\[ # Match literal characters '[['
| # Match previous atom or next atom
]] # Match literal characters ']]'
) # End non-capturing group
) # End non-capturing negative look-ahead group
. # Match any character
) # End non-capturing group
*? # Reluctantly match previous atom zero or more times
) # End non-capturing group
*? # Reluctantly match previous atom zero or more times
]] # Match literal characters ']]'
This will only handle one level of nested [[...]]
patterns. As noted in this answer to this question that TJR commented about above, regular expressions will not handle unlimited nested atoms. So this regular expression pattern will not match something like [[foo[[baz]]bar]]
within a [[image:...]]
string.
For a great regular expressions reference, see Regular-Expressions.info.
Upvotes: 0
Reputation: 76
Maybe like this:
(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])
I tried
public class My {
public static void main(String[] args) {
String foo = "[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]";
Matcher m = Pattern.compile("(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])").matcher(foo);
while (m.find()) {
System.out.print(m.group(1));
}
}}
And it prints
[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]
Hope this helps :D
Upvotes: 0
Reputation: 424983
This works:
str.replaceAll("^\\[\\[([^\\[]*?(\\[\\[[^\\]]*\\]\\])?[^\\[]*?)*?\\]\\]\\s*", "");
Output from your input:
In 1907, the [[International...
This works because it's looking for matching pairs of [[
and ]]
(and surrounding text) inside the first such pair.
Upvotes: 0
Reputation: 3773
What's up with this example?
s.replaceAll("(\\[{2}Image:(?:(?:\\[{2}).*\\]{2}|[^\\[])*\\]{2})", "");
Would replace this text only:
[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]
Upvotes: 0
Reputation: 5786
Lets see... what about using lazy repetition instead of greedy?
\[\[Image:.*?observances\]\]
Upvotes: 2