Dan Q
Dan Q

Reputation: 2257

What regex in java can capture and remove this pattern?

Suppose I have a few lines out of wikipedia XML that looks like this:

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

I want to remove the line that begins with [[Image:" and closed by "observances]]. There could be several other lines of text that have brackets as well and I don't want to do a greedy search otherwise it may accidentally remove those other brackets too.

For example, if I just did a greedy \\[\\[Image:.*\\]\\], I believe it will remove everything up to the last closing brackets (Ericco Malatesta)

Is there a regular expression that can make this easier for me?

Upvotes: 2

Views: 217

Answers (5)

Go Dan
Go Dan

Reputation: 15502

Using the following test string (note, I added an additional [[image:foobar[[foo [baz] bar]]foobar]] in there):

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of[[image:foobar[[foo [baz] bar]]foobar]] Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

And a regular expression pattern of:

(?i)\\[\\[image:(?:\\[\\[(?:(?!(?:\\[\\[|]])).)*]]|(?:(?!(?:\\[\\[|]])).)*?)*?]]

testString.replaceAll(<above pattern>, "") will return:

 In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

Here's a more detailed explanation of the regular expression:

(?i)                    # Case insensitive flag
\[\[image:              # Match literal characters '[[image:'
(?:                     # Begin non-capturing group
  \[\[                  # Match literal characters '[['
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *                     # Match previous atom zero or more times
  ]]                    # Match literal characters ']]'
  |                     # Match previous atom or next atom
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *?                    # Reluctantly match previous atom zero or more times
)                       # End non-capturing group
*?                      # Reluctantly match previous atom zero or more times
]]                      # Match literal characters ']]'

This will only handle one level of nested [[...]] patterns. As noted in this answer to this question that TJR commented about above, regular expressions will not handle unlimited nested atoms. So this regular expression pattern will not match something like [[foo[[baz]]bar]] within a [[image:...]] string.

For a great regular expressions reference, see Regular-Expressions.info.

Upvotes: 0

gwokae
gwokae

Reputation: 76

Maybe like this:

(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])

I tried

public class My {

public static void main(String[] args) {
    String foo = "[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]";
    Matcher m = Pattern.compile("(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])").matcher(foo);
    while (m.find()) {
        System.out.print(m.group(1));
    }
}}

And it prints

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]

Hope this helps :D

Upvotes: 0

Bohemian
Bohemian

Reputation: 424983

This works:

str.replaceAll("^\\[\\[([^\\[]*?(\\[\\[[^\\]]*\\]\\])?[^\\[]*?)*?\\]\\]\\s*", "");

Output from your input:

In 1907, the [[International...

This works because it's looking for matching pairs of [[ and ]] (and surrounding text) inside the first such pair.

Upvotes: 0

TJR
TJR

Reputation: 3773

What's up with this example?

s.replaceAll("(\\[{2}Image:(?:(?:\\[{2}).*\\]{2}|[^\\[])*\\]{2})", "");

Would replace this text only:

  • [[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]

Upvotes: 0

aleph_null
aleph_null

Reputation: 5786

Lets see... what about using lazy repetition instead of greedy?

\[\[Image:.*?observances\]\]

Upvotes: 2

Related Questions