Reputation: 507
I've malformed xml file that contains extra quotes in a tag. I would like to remove them or replace by "e. Malformed XML looks looks like:
<CLASS ATT2="PDX"R"088">
My expected result:
<CLASS ATT2="PDX R 088">
or
<CLASS ATT2="PDX"R"088">
I've tried to iterate through all lines and finding ATT first and last indexes but it's quite dirty and produces too much code.
Do anyone have simple solution for this?
Upvotes: 0
Views: 215
Reputation: 3578
Not the best solution maybe, but since you cannot parse it with (e.g.) xml.etree as it is invalid, you can try playing with something like the code below.
It will:
CLASS
)CLASS
is found, find all the occurrences of double quotes ("
)WARNING: BACKUP YOUR ORIGINAL FILE AS THIS WILL MODIFY IT!!!
import re
f = open(r'YOUR/FILE/HERE',"r+b")
lines = f.readlines()
for idx, row in enumerate(lines):
if "CLASS" in row:
quote_index = [x.start() for x in re.finditer('\"', row)]
if len(quote_index) > 2:
replace_quote = quote_index[1:-1]
correct_row = list(row)
for quotes in replace_quote:
correct_row[quotes] = " "
new_row = "".join(correct_row)
lines[idx] = new_row
f.seek(0)
f.truncate()
f.write(''.join(lines))
f.close()
Upvotes: 0
Reputation: 336148
This is not 100% foolproof, but might work with a little luck:
re.sub(r'(?<!=)"(?!>)', '"', malformed_xml)
will only replace quotes that are neither preceded by =
nor followed by >
.
If there could be whitespace after =
(or before >
), you can't use the re
module anymore, but the regex
module (PyPI) can work with this:
regex.sub(r'(?<!=\s*)"(?!\s*>)', '"', malformed_xml)
Upvotes: 1