Reputation: 41
I have a word doc of following format
test1.docx
["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]
Tried to locate tag starting with <s.*?> and replace the tag and its contents with ""
def locatestag():
fileis = open("test1.docx")
for line in fileis:
print(line)
newfile = re.sub('<s.*?> .*? ','',line)
with open("new file.json","w") as newoutput:
son.dump(newfile, newoutput)
The final output file also makes tag like disapper.
The final contents is like
["<h58>This is article|", "", ", ]
How to remove <s.*> and its contents only while retaining rest of the tag (ie retain tag )
Upvotes: 1
Views: 45
Reputation: 2304
Adjust your regex so that only the <s.*>
tags and their contents are matched.
for line in fileis:
print(line)
newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)
The resulting newfile
will be:
["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]
Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.
Upvotes: 1
Reputation:
You just want to remove the tags, and not everything after the tags so there is no need to add the extra .*?
re.sub('<s.*?>','',line)
Upvotes: 1