Reputation: 126
I'm writing a script which is used to tidy up MediaWiki files prior to conversion to confluence mark-up, this particular scenario I'm needing to fix page links which in MediaWiki are something like this
[[this is a page]]
the problem being that the actual page link would be this_is_a_page, the universal wiki converter isn't smart enough to realise this when it converts to confluence mark-up so you end up with broken links.
I've been trying to create a regex as part of my python script (I've already stripped out html and some other tags like < gallery> etc., the following regex selects all the links in question:
'\[\[(.*?)\]\]'
I just cant find a programmatic way to select only the spaces inside the [[ ]] so I can substitute them out for underscores. I've attempted using matches with no success.
Upvotes: 2
Views: 1434
Reputation: 5193
Try with re.sub
and lambda expression
>>> import re
>>> test = '[[this is a page]] bla bla [[this is another page]]'
>>> re.sub(r'\[\[.+?\]\]', lambda x:x.group().replace(" ","_"), test)
'[[this_is_a_page]] bla bla [[this_is_another_page]]'
Upvotes: 3
Reputation: 174706
Try the below regex and replace the matched spaces with underscores.
\s(?=[^\[\]]*]])
>>> import re
>>> s = " [[this is a page]] goo hghg"
>>> m = re.sub(r'\s(?=[^\[\]]*]])', "_", s)
>>> m
' [[this_is_a_page]] goo hghg'
\s(?=[^\[\]]*]]
, it would match the spaces only if it's followed by any character not of [
or ]
zero or more times and the two closing ]]
brackets.
Upvotes: 3