Erases the following tag while trying to replace a html tag using Python

Question

I have a word doc of following format

test1.docx

 ["This is article|", "", ", "Author is|", "Research is on|", "CASE IS|", "1-3|"]

Tried to locate tag starting with and replace the tag and its contents with ""

def locatestag():
 fileis = open("test1.docx")

 for line in fileis:
   print(line)
   newfile = re.sub(' .*? ','',line)

with open("new file.json","w") as newoutput:
   son.dump(newfile, newoutput)

The final output file also makes tag like disapper.

The final contents is like

["This is article|", "", ", ]

How to remove and its contents only while retaining rest of the tag (ie retain tag )

BiOS · Accepted Answer

Adjust your regex so that only the tags and their contents are matched.

for line in fileis:
    print(line)
    newfile = re.sub(']*>[^"]*(?=")','',line)

The resulting newfile will be:

["This is article|", "", ", "", "", "CASE IS|", ""]

Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.

Erases the following tag while trying to replace a html tag using Python

Answers (2)

Here is the final code for you

Related Questions