Robur_131
Robur_131

Reputation: 694

Removing tags from a field in an XML file

I have an XML file that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out from the first half and are now sailing across the second half.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wouldn't it have been quicker to set sail in the opposite direction from where they started?     &lt;/p&gt;&#xA;" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="&lt;one-piece&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai's power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="&lt;yu-yu-hakusho&gt;" AnswerCount="1" CommentCount="0" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="&lt;p&gt;In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What's the significance of the watermelon and why does she carry one around?&lt;/p&gt;&#xA;" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="&lt;sora-no-otoshimono&gt;" AnswerCount="2" CommentCount="1" />

Specifically the file contains numerous lines. Each line starts with a row tag. What I want to do is to capture the Body field inside the row tag. For example, the Body field for Id = 2 is:

"&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai's power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;"

I've parsed the Body fields using ElementTree. What I want to do next is to parse the words inside the Body field of each row. For that, I need to strip the Body field of any html tags. For example, after stripping the text of html tags, the Body field of Id = 2 should look like this:

In the middle of The Dark Tournament Yusuke Urameshi gets to fully inherit Genkai's power of the .... (continued)

What I've tried so far:

def remove_html_tags(text):
        return bs4.BeautifulSoup(text, "html.parser").text

This results in:

pin the middle of emthe dark tournamentem yusuke urameshi gets to fully inherit genkais power of the emspirit waveem by absorbing a ball of energy from herp
phowever this process turns into an excruciating trial for yusuke almost killing him and keeping him doubled over in extreme pain for a long period of time so much so that his spirit animal poo is also in pain and flies to him to try to helpp
pmy question is why is it such a painful procedure to learn and absorb this powerp

As you can see, the symbols are gone but the texts enclosed inside the symbols remain. What can I do to remove them?

Upvotes: 1

Views: 70

Answers (2)

dabingsou
dabingsou

Reputation: 2469

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
xml = '''<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out from the first half and are now sailing across the second half.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wouldn't it have been quicker to set sail in the opposite direction from where they started?     &lt;/p&gt;&#xA;" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="&lt;one-piece&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai's power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="&lt;yu-yu-hakusho&gt;" AnswerCount="1" CommentCount="0" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="&lt;p&gt;In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What's the significance of the watermelon and why does she carry one around?&lt;/p&gt;&#xA;" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="&lt;sora-no-otoshimono&gt;" AnswerCount="2" CommentCount="1" />
'''
doc = SimplifiedDoc(xml)
rows = doc.selects('row>Body()') 
print ([doc.removeHtml(doc.unescape(row)) for row in rows])

Result:

['Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line. The Straw Hats started out from the first half and are now sailing across the second half. Wouldn', 'In the middle of The Dark Tournament, Yusuke Urameshi gets to fully inherit Genkai', 'In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons. What']

Upvotes: 1

MendelG
MendelG

Reputation: 20088

Try this:

import re
from bs4 import BeautifulSoup

xml = """
<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out from the first half and are now sailing across the second half.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wouldn't it have been quicker to set sail in the opposite direction from where they started?     &lt;/p&gt;&#xA;" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="&lt;one-piece&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai's power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="&lt;yu-yu-hakusho&gt;" AnswerCount="1" CommentCount="0" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="&lt;p&gt;In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What's the significance of the watermelon and why does she carry one around?&lt;/p&gt;&#xA;" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="&lt;sora-no-otoshimono&gt;" AnswerCount="2" CommentCount="1" />
  """
    
soup = BeautifulSoup(xml, "html.parser")
for tag in soup.select("posts row"):
    result = re.sub("<.*?>", "", tag["body"])
    print(result.strip())

Output:

Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.

The Straw Hats started out from the first half and are now sailing across the second half.

Wouldn't it have been quicker to set sail in the opposite direction from where they started?
In the middle of The Dark Tournament, Yusuke Urameshi gets to fully inherit Genkai's power of the Spirit Wave by absorbing a ball of energy from her.

However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.

My question is, why is it such a painful procedure to learn and absorb this power?
In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.

What's the significance of the watermelon and why does she carry one around?

Upvotes: 1

Related Questions