MrNetroful
MrNetroful

Reputation: 507

Escape extra quotes in malformed xml

I've malformed xml file that contains extra quotes in a tag. I would like to remove them or replace by &quote. Malformed XML looks looks like:

<CLASS ATT2="PDX"R"088">

My expected result:

<CLASS ATT2="PDX R 088">
or
<CLASS ATT2="PDX&quot;R&quot;088">

I've tried to iterate through all lines and finding ATT first and last indexes but it's quite dirty and produces too much code.

Do anyone have simple solution for this?

Upvotes: 0

Views: 215

Answers (2)

umbe1987
umbe1987

Reputation: 3578

Not the best solution maybe, but since you cannot parse it with (e.g.) xml.etree as it is invalid, you can try playing with something like the code below.

It will:

  1. open the file
  2. read it line by line
  3. search for each line if there's a specific string (e.g. CLASS)
  4. if CLASS is found, find all the occurrences of double quotes (")
  5. check if more than two double-quotes are found and replace them with white space
  6. update the lines

WARNING: BACKUP YOUR ORIGINAL FILE AS THIS WILL MODIFY IT!!!

import re

f = open(r'YOUR/FILE/HERE',"r+b")
lines = f.readlines()
for idx, row in enumerate(lines):
     if "CLASS" in row:
         quote_index = [x.start() for x in re.finditer('\"', row)]
         if len(quote_index) > 2:
             replace_quote = quote_index[1:-1]
             correct_row = list(row)
             for quotes in replace_quote:
                 correct_row[quotes] = " "
             new_row = "".join(correct_row)
             lines[idx] = new_row
f.seek(0)
f.truncate()
f.write(''.join(lines))
f.close()

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336148

This is not 100% foolproof, but might work with a little luck:

re.sub(r'(?<!=)"(?!>)', '&quot;', malformed_xml)

will only replace quotes that are neither preceded by = nor followed by >.

If there could be whitespace after = (or before >), you can't use the re module anymore, but the regex module (PyPI) can work with this:

regex.sub(r'(?<!=\s*)"(?!\s*>)', '&quot;', malformed_xml)

Upvotes: 1

Related Questions