Reputation: 27
I'm trying to write a java regular expression to extract text from wikipedia links, but I'm coming up short.
In essence I want to extract <article alias>
from [[Some Article|<article alias>]]
. The sequence [[<Any article>|<any alias>]]
will show up an unknown amount of times for any given string.
Basically I'm looking for a regular expression to put in <regexp here>
:
final String someRandomText = "Some random text about [[Roman Empire|the romans]]";
final String replaced = someRandomText.replaceAll("<regexp here>", "$1");
Any ideas?
Upvotes: 0
Views: 220
Reputation: 79115
By using the regex, \[\[[^|]*\|(.*)\]\]
you can retrieve group(1) from the matched text.
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String args[]) {
Matcher matcher = Pattern.compile("\\[\\[[^|]*\\|(.*)\\]\\]")
.matcher("Some random text about [[Roman Empire|the romans]]");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
Output:
the romans
Explanation of the regex at regex101:
Upvotes: 4
Reputation: 163372
If you want to match that format, you might also use a capture group.
\[\[[^|]+\|(.+?)]]
\[\[
Match [[
[^|]+\|
Match 1+ times any char except |
, then match the |
(.+?)
Capture group 1, match as least as possible chars]]
Match ]]
Example code
String regex = "\\[\\[[^|]+\\|(.+?)]]";
String string = "Some random text about [[Roman Empire|the romans]] test Some random text about [[Another Empire|the romans 2]]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
the romans
the romans 2
If you want to use replaceAll
for the given example, you can match as least as possible chars up to the pattern, and then replace with group 1 using $1
final String someRandomText = "Some random text about [[Roman Empire|the romans]]";
final String replaced = someRandomText.replaceAll(".*?\\[\\[[^|]+\\|(.+?)]]", "$1");
System.out.println(replaced);
Output
the romans
Upvotes: 4