Glogo
Glogo

Reputation: 2884

Regex match Wikipedia internal article links

I want to regex match text in Wikipedia article source code with following rules:


  1. Match only links to internal articles. So don't match links with any namespaces like files, categories, users, ... etc (complete list of these namespaces here)
    • Example link to match [[Without|namespace]]
    • Example links NOT to match [[Category:Nope]], [[File:Nopeish]] etc.

  1. Match only links having delimiter "|". Links with this symbol are displayed in article with different text as the title of article they are referring to
    • Example link to match [[Something|else]]
    • Example link NOT to match [[text]]

  1. Match links in two groups
    • Example link to match [[Something|else]] will be matched into two groups with text:
      1. group: "Something"
      2. group: "else"

I have tested this and so far I've come up with following regex: \[\[(?!.+?:)(.+?)\|(.+?)\]\] which is not working as expected since it also matches text like this:

[[Problem]] non link text [[Another link|problemAgain]]
  ^------------ group 1 (wrong) -------^ ^-group 2 -^

[[This should be|matched|]]

DEMO

Thanks

Upvotes: 3

Views: 2099

Answers (1)

Avinash Raj
Avinash Raj

Reputation: 174736

Just use a negated character class instead of .+?,

\[\[(?!.+?:)([^\]\[]+)\|([^\]\[]+)\]\]

Java regex would be,

"\\[\\[(?!.+?:)([^\\]\\[]+)\\|([^\\]\\[]+)\\]\\]"

DEMO

OR

simply you could do like this,

\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]

Java regex would be,

"\\[\\[([^\\]\\[:]+)\\|([^\\]\\[:]+)\\]\\]"

DEMO

Upvotes: 3

Related Questions