Amelia
Amelia

Reputation: 41

Erases the following tag while trying to replace a html tag using Python

I have a word doc of following format

test1.docx

 ["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]

Tried to locate tag starting with <s.*?> and replace the tag and its contents with ""

def locatestag():
 fileis = open("test1.docx")

 for line in fileis:
   print(line)
   newfile = re.sub('<s.*?> .*? ','',line)

with open("new file.json","w") as newoutput:
   son.dump(newfile, newoutput)

The final output file also makes tag like disapper.

The final contents is like

["<h58>This is article|", "", ", ]

How to remove <s.*> and its contents only while retaining rest of the tag (ie retain tag )

Upvotes: 1

Views: 45

Answers (2)

BiOS
BiOS

Reputation: 2304

Adjust your regex so that only the <s.*> tags and their contents are matched.

for line in fileis:
    print(line)
    newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)

The resulting newfile will be:

["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]

Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.

Upvotes: 1

user14185615
user14185615

Reputation:

You just want to remove the tags, and not everything after the tags so there is no need to add the extra .*?

Here is the final code for you

re.sub('<s.*?>','',line)

Upvotes: 1

Related Questions