Reputation: 168

Regular expression needed to remove custom markup tags

I have not written regular expressions before and my knowledge is woefully inadequate. I am hoping the experts here would be able to help me with a regular expression that I could use in C# to remove the markup tags only.

The markup has one the following opening tags: <AI>! or <AH>! or <AG>! and is ended with another !

Example: the quick brown <AI>!fox jumps! over the lazy dog!

After markup removed should be: the quick brown fox jumps over the lazy dog!

Code Snippet:

    NOT MORE THAN 85 % OF H<AH>!3!BO<AH>!3! CALCULATED ON THE DRY WEIGHT
- Uranium ores and pitchblende, and concentrates thereof, with a uranium content of more than 5 % by weight (<AI>!Euratom!)
- Monazite; urano-thorianite and other thorium ores and concentrates, with a thorium content of more than 20 % by weight (<AI>!Euratom!)
- - - -   -94% or more, but not more than 98.5% of a-Al<AH>!2!O<AH>!3!  -2% (+/-1.5%) of magnesium spinel,  -1% (+/-0.6%) of yttrium oxide and   -2% (+/-1.2%) of each lanthanum oxide and neodymium oxide  with less than 50% of the total weight having a particle size of more than 10mm
- Activated alumina with a specific surface area of at least 350 m<AG>!2!g
IRON OXIDES AND HYDROXIDES; EARTH COLOURS CONTAINING 70 % OR MORE BY WEIGHT OF COMBINED IRON EVALUATED AS FE<AH>!2!O<AH>!3!:
- <AI>!o!-Xylene
- <AI>!m!-Xylene
- <AI>!p!-Xylene
- - - 1,6,7,8,9,14,15,16,17,18,18-Dodecachloropentcyclo[12.2.1.1<AG>!6,9!.0<AG>!2,13!.0<AG>!5,10!]octadeca-7,15-diene, (CAS RN 13560-89-9)
- Chlorobenzene, <AI>!o!-dichlorobenzene and <AI>!p!-dichlorobenzene
- - - Di- or tetrachlorotricyclo[8.2.2.2<AG>!4,7!]xadeca-1(12),4,6,10,13,15-hexaene, mixed isomers
- Butan-1-ol (<AI>!n!-butyl alcohol)
- - 2-Methylpropan-2-ol (<AI>!tert!-butyl alcohol)
- <AI>!n!-Butyl acetate
- <AI>!O!-Acetylsalicylic acid, its salts and esters
- - <AI>!O!-Acetylsalicylic acid (CAS RN 50-78-2)
- 1-Naphthylamine (<AH>!alpha!-naphthylamine), 2-naphthylamine (<AI>!beta!-naphthylamine) and their derivatives; salts thereof
- <AI>!o!-, <AG>!m!-, <AH>!p!-Phenylenediamine, diaminotoluenes, and their derivatives; salts thereof:
- - <AI>!o!-, <AI>!m!-, <AI>!p!-Phenylenediamine, diaminotoluenes and their halogenated, sulphonated, nitrated and nitrosated derivatives; salts thereof:
- - Indole, 3-methylindole (skatole), 6-allyl-6,7-dihydro-5<AI>!H!-dibenz[<AI>!c,e!] azepinne (azapetine), phenindamine (INN) and their salts; imipramine hydrochloride (INNM)
- Vitamin B<AH>!1! and its derivatives
- Vitamin B<AH>!2! and its derivatives

Thank you in advance

Upvotes: 2

Answers (3)

Rob Rodi

Reputation: 3494

The regex to use will look for A followed by one of [GHI] enclosed in <>!. After it finds that, it will do a lazy search (denoted by the ?) of one or more(+) anything(.) followed by an exclamation mark. It's lazy so it doesn't seek until it finds the last exclamation in the sample, it will instead stop at the first exclamation and replace what it finds. It will then use grouping (the parenthesis in the pattern) to store the value contained within your tags, and will use it when replacing ($1 denotes first group).

var r = new Regex("<A[GHI]>!(.+?)!");
var actual = r.Replace(xml, "$1");

Upvotes: 5

scibuff

Reputation: 13765

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
       string pattern =  @"\<A(G|H|I)\>\!([^\!]*)\!";
       string input = "<AI>!n!-Butyl acetate the quick brown "
           + "<AI>!fox jumps! over the lazy dog!";
       string replacement = "$2";
       Regex rgx = new Regex(pattern);
       string result = rgx.Replace(input, replacement);

       Console.WriteLine("Original String:    '{0}'", input);
       Console.WriteLine("Replacement String: '{0}'", result);                             
    }
}

Original String:    '<AI>!n!-Butyl acetate the quick brown <AI>!fox jumps! over the lazy dog!'
Replacement String: 'n-Butyl acetate the quick brown fox jumps over the lazy dog!'

http://ideone.com/z0fbL

Upvotes: 0

Roy Dictus

Reputation: 33149

The regex to use would have to be something like this:

\<..\>!([^!]*)!

because you must match < two letters > ! a series of characters without a ! and finally a ! again.

You then replace the match (the whole text that matches the expression above) by the captured match (that is, the text between the parentheses).

Upvotes: 0

Regular expression needed to remove custom markup tags

Answers (3)

Related Questions