Reputation: 23
I just wrote a Flex app which handles some Wikipedia text content as strings. I'm trying to use RegExp to cleen all the Wikipedia markup. Here is an example:
I'd like this:
var pageText:String = new String("was an [[People of the United States|American]] [[film director]], writer, [[Film producer|producer]], and [[photographer]] who lived in England during most of the last four decades of his career. Kubrick was noted for the scrupulous care with which he chose his subjects, his slow method of working, the variety of genres he worked in, his technical perfectionism, and his reclusiveness about his films and personal life. He maintained almost complete artistic control, making movies according to his own whims and time constraints, but with the rare advantage of big-[[Movie studio|studio]] [[financial support]] for all his endeavors.");
to look like this:
var pageText:String = new String("was an American film director, writer, producer, and photographer who lived in England during most of the last four decades of his career. Kubrick was noted for the scrupulous care with which he chose his subjects, his slow method of working, the variety of genres he worked in, his technical perfectionism, and his reclusiveness about his films and personal life. He maintained almost complete artistic control, making movies according to his own whims and time constraints, but with the rare advantage of big-studio financial support for all his endeavors.");
So I need to write a RegExp which [[ Remove this part | but keep this one ]].
I tested these ones among others:
var pattern:RegExp = new RegExp(/\[\[(.+)\|/);
var pattern2:RegExp = new regExp(/^\[\[\|/);
var pattern3:RegExp = new RegExp(/^\[\[[A-Z].*\|$/);
var pageTextCleaned:String = pageText.replace(pattern, " ");
Then it would be easy to just remove the remaining [[ and ]]
I'm not used at all with this RegExp stuff, so any help would be great!
Thanks!
Upvotes: 2
Views: 630
Reputation: 46513
Since I'm not sure whether the max # of entries is >2, here's a solution that loops through, replacing each entry that ends in "|" with "[[" until none are left, then removes the "[[" and "]]". If there's always only two, you can simplify a little to speed it up:
var entryPattern:RegExp = new RegExp(/\[\[\w+\|/);
var bracketPattern:RegExp = new regExp(/[\[\[|\]\]]/);
var pageText:String = "your text";
var replacedText:String = "";
while( pageText != replacedText ) {
if( replacedText != "" ){ pageText = replacedText; }
replacedText = pageText.replace(entryPattern, "[[");
}
replacedText = "";
while( pageText != replacedText ) {
if( replacedText != "" ){ pageText = replacedText; }
replacedText = pageText.replace(bracketPattern, "");
}
You'll probably want to drop the replace loop into your own utility "replaceAll" function, as it comes in handy everywhere.
Upvotes: 0
Reputation: 138017
I don't know about AS3, but here's a JavaScript code to achieve that, which should be similar:
s = s.replace(/\[\[(?:([^\]|]*)|[^\]|]*\|([^\]]*))\]\]/g, '$1$2');
The regex is pretty confusing. Here's a break down of it's pieces:
\[\[
- two opening square brackets.(?: | )
- non capturing group with two options:
([^\]|]*)
- content with does not contain the pipe character, capture the entire content to the first group, $1
.[^\]|]*\|([^\]]*)
- link with the pipe character:
[^\]|]*
- some characters that are not ]
or |
.\|
- literal pipe sign.([^\]]*)
- some more non ]
characters, capture into the second group, $2
.\[\[
- two closing square brackets.
We then replace each capture with $1$2
- one of them is always empty, and the other is the string we want to keep.
Working example: http://jsbin.com/adedu4
Upvotes: 0
Reputation: 2434
You are using the RegExp constructor which takes a string as its argument, but feeding it a RegExp. I don't think that works as you want.
See if it works with a lexical RegExp:
var pageTextCleaned:String = pageText.replace(/\[\[([^\]]*\|)?([^\]]+)]]/g, "$2");
This isn't robust if you've got single ]
s or multiple |
s inside the [[...]]
s, but it's a start.
Upvotes: 4