BrianFreud
BrianFreud

Reputation: 7454

Regexp assistance needed parsing mediawiki template with Javascript

I'm handling Mediawiki markup with Javascript. I'm trying to remove certain parameters. I'm having trouble getting to exactly the text, and only the text, that I want to remove.

Simplified down, the template text can look something like this:

{{TemplateX
| a =
Foo bar
Blah blah

Fizbin foo[[domain:blah]]

Ipsum lorem[[domain:blah]]
|b =1
|c = 0fillertext
|d = 1alphabet
| e =
| f = 10: One Hobbit
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
| j = Level 4 [[domain:filk|Songs]]
| k =7 fizbin, 8 [[domain:trekkies|Shatners]]
|l = 
|m = 
}}

The best I've come up with so far is

/\|\s?(a|b|d|f|j|k|m)([^][^\n\|])+/gm

Updated version:

/\|\s?(a|b|d|f|j|k|m)(?:[^\n\|]|[.\n])+/gm

which gives (with the updated regexp):

{{TemplateX


|c = 0fillertext

| e =

| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000

|Songs]]

|Shatners]]
|l = 

But what I'm trying to get is:

{{TemplateX
|c = 0fillertext
| e =
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
|l = 
}}

I can deal with the extraneous newlines, but I still need to make sure that '|Songs]]' and '|Shatners]]' are also matched by the regexp.

Regarding Tgr's comment below,

For my purposes, it is safe to assume that every parameter starts on a new line, where | is the first character on the line, and that no parameter definition includes a | that isn't within a [[foo|bar]] construct. So '\n|' is a safe "start" and "stop" sequence. So the question boils down to, for any given params (a,b,d,f,j,k, and m in the question), I need a regex that matches 'wanted param' in the following:

| [other param 1] = ... 
| [wanted param] = possibly multiple lines and |s that aren't after a newline
| [other param 2]

Upvotes: 0

Views: 140

Answers (3)

ignacio
ignacio

Reputation: 1

I came down with a pretty ugly regex that can be used with a loop to get a dictionary with all the parameters name and values. It assumes 2 things: all parameters have names, and only one level of template/link recursion.

So it correctly parses

text = `=={{int:filedesc}}==
{{Book 
| Title          = {{es|1=Cuadragésima octava memoria de la Empresa de los     Ferrocarriles del Estado de Chile, correspondiente al año 1931}}
| Volume         = 
| Publisher      = Imprenta de los Ferrocarriles del Estado
| Source         = {{Memoria Chilena|https://memoriachilena.gob.cl/602/w3-article-553566.html}}
| Permission     = [[example|1]]
| Image_Page     = 
| Wikisource     = s:es:Índice:{{PAGENAME}}
| Other_Versions = {{Image extracted|1=48a Memoria de los Ferrocarriles del Estado de Chile (1932) (page 29 crop).jpg}}
| Wikidata       = Q125002636
}}`
rexp = /{{ *Book\s*\|\s*([\w ]*=*)((?:[^{}\[\]\|]|{{[^}]+}}|\[\[[^\]]+\]\])*)(?=\||}})/i;
end = false
params = {}
while (!end) {
  a1 = text.match(rexp)
  if (a1){
    param_name = a1[1].replace("=",'').trim()
    param_content = a1[2].trim()
    params[param_name]=param_content;
    text = text.replace(rexp, '{{Book');
}
  else{
    end=true;
  }
}

this code give you a dictionary params that contains every parameter to the Book template.

its a hell of a regex, but this is what it does:

{{ *Book\s*\|\s*     \\ where Book is the template name
([\w ]*=*)`          \\ first capture group: the parameter
(                    \\ start of the second capture group  
(?:[^{}\[\]\|]|      \\ non capturing group option 1: no nonsense
{{[^}]+}}|           \\ option 2: a template without any other template inside
\[\[[^\]]+\]\])    \\ option 3: an internal link without any other link inside
*                 \\ 0 to unlimited times 
)                  \\ end of the second capture group
(?=\||}})          \\ look ahead for the next pipe or end of template

Upvotes: 0

Robin Mackenzie
Robin Mackenzie

Reputation: 19319

You can try this below - it is matching on the variables you want to include, not those you want to exclude:

(^{{TemplateX)|\|\s*(c|e|g|h|i|l[ ]*\=[ ]*)(.*)|(}}$)

Tested here.

Edit

I enhanced it to this which I think is a bit better if you compare the two regexes using the diagram tool at regexper.com:

(^{{TemplateX)|(\|[ ]*)(c|e|g|h|i|l)([ ]*\=[ ]*)(.*)|(}}$)

enter image description here

Edit 2

Further to the comments, the regex to match the unwanted parameters is this:

\|[ ]?(a|b|d|f|j|k|m)([ ]*\=[ ]*)((?![\r\n]+\|)[0-9a-zA-Z, \[\]:\|\r\n\t])+

Leveraging this answer - it uses a negative lookahead to only match upto [\r\n]+\| which will in part satisfy the statement that:

So '\n|' is a safe "start" and "stop" sequence

Tested here with the introduction of a few newlines in the parameters to be retained (e.g. g).

The visual explanation:

enter image description here

There is a risk that you may have a parameter value with a character other than

[0-9a-zA-Z, \[\]:\|\r\n\t]

To solve that you would need to update that list.

Upvotes: 2

Tgr
Tgr

Reputation: 28160

Trying to account for the full flexibility of template language is hopeless. For example, a template could look like

{{TemplateX
| a=1 | b=2 }}

or

{{TemplateX|
| a=1 <nowiki>|</nowiki> b=2 }}

which is completely different (the first one has two parameters, a and b, the second one a single a parameter). Regular expressions are (mostly) context-free and can't grasp constructs like that.

So unless you are sure the template is always used according to the same convention, you are better off using some proper parser such as mwparserfromhell:

import mwparserfromhell
wikicode = mwparserfromhell.parse(text)
for template in wikicode.filter_templates(recursive=True, matches=lambda t: t.name.strip() == 'TemplateX'):
for param in ['a', 'b', 'd', 'f', 'j', 'k', 'm']:
    template.remove(param)
print(wikicode)

(This would require rewriting your code in Python or calling out to a Python backend service. I don't think there is any good wikitext parser in Javascript.)

Alternatively, you can use the parse API with the prop=parsetree to get an XML tree representation of the template and its arguments, which is not that hard to process.

Upvotes: 0

Related Questions