aphexlog
aphexlog

Reputation: 1775

Remove select characters from xml tags using regex

I am trying to remove only select characters from xml tags + any digit that follows + the proceeding : .. for example: <ns2:projectArea alias= should look like <projectArea alias= and <ns9:name> should look like <name>

Basically, the digit will be random (anything from 1-9) and there will always be a proceeding : that must be deleted.

What I have so far is:

import argparse
import re

# Initiates argument
parser = argparse.ArgumentParser()

parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")

args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
  print("inputfile location is %s" % args.input)
if args.output:
  print("outputfile location is %s" % args.output)
# End argument

text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
    f.write(text)

This piece of the code is the issue: '<[^<]+>' It deletes entire tags, so if i need to search text later on, basically have to search plain text rather than by tags.

What can I replace '<[^<]+>' with that will delete ns + the following number (whatever number it may be) + the : that follows it?

Upvotes: 0

Views: 232

Answers (3)

Matt.G
Matt.G

Reputation: 3609

Regex: (?:(?<=<)|(?<=<\/))(ns[0-9]+:)(?=[^>]*?>)

Demo

Upvotes: 0

user557597
user557597

Reputation:

This works :

Find r"<(?:(?:(/?)\w+[1-9]:(\w+\s*/?))|(?:\w+[1-9]:(\w+\s+(?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+\s*/?)))>"
Replace <$1$2$3>

https://regex101.com/r/yRhMI9/1

Readable version :

 <
 (?:
      (?:
           ( /? )                        # (1)
           \w+ [1-9] :
           ( \w+ \s* /? )                # (2)
      )
   |  (?:
           \w+ [1-9] :
           (                             # (3 start)
                \w+ \s+ 
                (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
                \s* /?
           )                             # (3 end)
      )
 )
 >

Upvotes: 0

Venkata Gogu
Venkata Gogu

Reputation: 1051

It might be happening because of the regex expression. Try using this regex expression instead:

   text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())

Upvotes: 1

Related Questions