ahemberg
ahemberg

Reputation: 27

Java regex match wikipedia link

I'm trying to write a java regular expression to extract text from wikipedia links, but I'm coming up short.

In essence I want to extract <article alias> from [[Some Article|<article alias>]]. The sequence [[<Any article>|<any alias>]] will show up an unknown amount of times for any given string.

Basically I'm looking for a regular expression to put in <regexp here>:

final String someRandomText = "Some random text about [[Roman Empire|the romans]]";
final String replaced = someRandomText.replaceAll("<regexp here>", "$1");

Any ideas?

Upvotes: 0

Views: 220

Answers (2)

Arvind Kumar Avinash
Arvind Kumar Avinash

Reputation: 79115

By using the regex, \[\[[^|]*\|(.*)\]\] you can retrieve group(1) from the matched text.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String args[]) {
        Matcher matcher = Pattern.compile("\\[\\[[^|]*\\|(.*)\\]\\]")
                .matcher("Some random text about [[Roman Empire|the romans]]");
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

Output:

the romans

Explanation of the regex at regex101:

enter image description here

Upvotes: 4

The fourth bird
The fourth bird

Reputation: 163372

If you want to match that format, you might also use a capture group.

\[\[[^|]+\|(.+?)]]
  • \[\[ Match [[
  • [^|]+\| Match 1+ times any char except |, then match the |
  • (.+?) Capture group 1, match as least as possible chars
  • ]] Match ]]

Regex demo | Java demo

Example code

String regex = "\\[\\[[^|]+\\|(.+?)]]";
String string = "Some random text about [[Roman Empire|the romans]] test Some random text about [[Another Empire|the romans 2]]";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Output

the romans
the romans 2

If you want to use replaceAll for the given example, you can match as least as possible chars up to the pattern, and then replace with group 1 using $1

final String someRandomText = "Some random text about [[Roman Empire|the romans]]";
final String replaced = someRandomText.replaceAll(".*?\\[\\[[^|]+\\|(.+?)]]", "$1");
System.out.println(replaced);

Output

the romans

Upvotes: 4

Related Questions