DevC
DevC

Reputation: 7423

Parsing svg in python

I have SVG and html file, which has couple of java script tags and I need to find all the script tags and insert a comment before the first script tag and than after the last script tag. I am trying to achieve it using Beautifulsoup. It worked well for HTML version but for SVG it is throwing error.

 //for html version of file, working as expected
 soup = BeautifulSoup(data,selfClosingTags=['link','meta'])
 for num,tag in enumerate(soup.findAll('script')):
        if num==0:
            soup.head.insert(-1,startcomment)
        tag.extract()
        soup.head.insert(len(-1,tag)
        if num==len(soup.findAll('script'))-1:
            soup.head.insert(-1,endcomment)

but now when I try to achieve same for the svg as soup = BeautifulSoup(data,"xml") in the first line itself it throws exception.. svg is also xml? so I should be able to do it sameway

Update - SVG format

<?xml version="1.0"?>
<?xml-stylesheet href="../../../some.css" type="text/css"?>
<svg id="mycontent" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"   xmlns:svg="http://www.w3.org/2000/svg" version="1.2" baseProfile="tiny" focusable="true" onload="Jsfunction.load()">
<script xlink:href="../first.js" />
<script xlink:href="../second.js" />
<script xlink:href="../third.js" />
</svg>

should be changed to

<?xml version="1.0"?>
<?xml-stylesheet href="../../../some.css" type="text/css"?>
<svg id="mycontent" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"   xmlns:svg="http://www.w3.org/2000/svg" version="1.2" baseProfile="tiny" focusable="true" onload="Jsfunction.load()">
<!-- some comment -->
<script xlink:href="../first.js" />
<script xlink:href="../second.js" />
<script xlink:href="../third.js" />
<!-- end comment -->
</svg>

Upvotes: 0

Views: 4080

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121942

Use BeautifulSoup version 4, not 3, and install lxml to handle XML parsing.

Currently, (as of version 4.3.2), BeautifulSoup does ignore Processing Instructions (like the <?xml-stylesheet?> instruction), see bug 1294645. You can work around this simply by patching the tree builder:

from bs4.builder import LXMLTreeBuilderForXML
from bs4 import ProcessingInstruction

def handle_pi(self, target, data):
    self.soup.endData()
    self.soup.handle_data(target + ' ' + data)
    self.soup.endData(ProcessingInstruction)

LXMLTreeBuilderForXML.pi = handle_pi

The bug has since been marked as solved, and as of BeautifulSoup 4.4 (released July 2015) you no longer need the above work-around.

You want to store the list of script tags in a variable so you can access the first and last tag without looping:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(data, 'xml')
start_comment = soup.new_string('some comment', Comment)
end_comment = soup.new_string('end comment', Comment)

script_tags = soup.find_all('script')
script_tags[0].insert_before(start_comment)
script_tags[-1].insert_after(end_comment)

For your sample SVG document, this results in:

>>> print soup.prettify(formatter='xml')
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="../../../some.css" type="text/css"?>
<svg:svg baseProfile="tiny" focusable="true" id="mycontent" onload="Jsfunction.load()" version="1.2" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
 <!--some comment-->
 <svg:script xlink:href="../first.js"/>
 <svg:script xlink:href="../second.js"/>
 <svg:script xlink:href="../third.js"/>
 <!--end comment-->
</svg:svg>

Upvotes: 2

Related Questions