Ecks
Ecks

Reputation: 57

How to use Python to parse a SVG document from URL (get points of a polyline)

I'm looking for a Python extension to parse a SVG's "points" values from the <polyline> elements and print them? Possibly to parse it from the URL? or I could save the SVG and do it locally.

I just need it to parse the points values and print them separately for each polyline element. So it will print something like this for each points value of the current <polyline> element.

[[239,274],[239,274],[239,274],[239,275],[239,275],[238,276],[238,276],[237,276],[237,276],[236,276],[236,276],[236,277] [236,277],[235,277],[235,277],[234,278],[234,278],[233,279],[233,279],[232,280] [232,280],[231,280],[231,280],[230,280],[230,280],[230,280],[229,280],[229,280]]

So after the first polyline element gets parsed and printed, it would parse the next polyline element and get the value for points and print it just like the first one until there is no more to be printed.

The SVG's URL: http://colorillo.com/bx0l.inline.svg

Here is a HTML example of a polyline element from the SVG

<polyline points="239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280" style="fill: none; stroke: #000000; stroke-width: 1; stroke-linejoin: round; stroke-linecap: round; stroke-antialiasing: false; stroke-antialias: 0; opacity: 0.8"/>

I'm just looking for some quick help, and a example.. If you're able to help me out that would be neat.

Upvotes: 0

Views: 1491

Answers (2)

balderman
balderman

Reputation: 23815

Below

import xml.etree.ElementTree as ET
from collections import namedtuple
import requests
import re

Point = namedtuple('Point', 'x y')

all_points = []
r = requests.get('http://colorillo.com/bx0l.inline.svg')
if r.status_code == 200:
    data = re.sub(' xmlns="[^"]+"', '', r.content.decode('utf-8'), count=1)
    root = ET.fromstring(data)
    poly_lines = root.findall('.//polyline')
    for poly_line in poly_lines:
        tmp = []
        _points = poly_line.attrib['points'].split(' ')
        for _p in _points:
            tmp.append(Point(*[int(z) for z in _p.split(',')]))
        all_points.append(tmp)

for points in all_points:
    tmp = [str([p.x, p.y]).replace(' ','') for p in points]
    line = ','.join(tmp)
    print('[' + line + ']')

Upvotes: 0

Israel Unterman
Israel Unterman

Reputation: 13510

I believe there is an HTML extraction package somewhere, but this is the kind of task I would do with core python, and the regular expressions module. Let txt be the text you presented <polyline..., so:

Importing regular expression module

In [22]: import re

Performing the search:

In [24]: g = re.search('polyline points="(.*?)"', txt)

In the above regex I use polyline points=" as an anchor (I omitted the < because it has a meaning in regex`) and capture all the rest until the next quotation marks.

The text you want is achieved by:

In [25]: g.group(1)
Out[25]: '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280'

Update

It's safer to use xml to parse the data, here is one way to do it (xml.etree is included with the standard library):

In [32]: import xml.etree.ElementTree as ET
In [33]: root = ET.fromstring(txt)

Since your data is formatted as a root tag already, you don't need futher extractions:

In [35]: root.tag
Out[35]: 'polyline'

And all the properties are actually XML attributes, converted to a dictionary:

In [37]: root.attrib
Out[37]:
{'points': '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280', 'style': 'fill: none; stroke: #000000; stroke-width: 1; stroke-linejoin: round; stroke-linecap: round; stroke-antialiasing: false; stroke-antialias: 0; opacity: 0.8'}

So here you have it:

In [38]: root.attrib['points']
Out[38]: '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280'

If you like further to split this to groups according to commas and spaces, I would do this:

Get all groups separated by a space using split with no arguments:

>>> p = g.group(1).split()
>>> p
['239,274', '239,274', '239,274', '239,275', '239,275', '238,276', '238,276', '237,276', '237,276', '236,276', '236,276', '236,277', '236,277', '235,277', '235,277', '234,278', '234,278', '233,279', '233,279', '232,280', '232,280', '231,280', '231,280', '230,280', '230,280', '230,280', '229,280', '229,280']

Now for each string, split it at the comma which will return a list of strings. I use map to convert each such list to a list of ints:

>>> p2 = [list(map(int, numbers.split(','))) for numbers in p]
>>> p2
[[239, 274], [239, 274], [239, 274], [239, 275], [239, 275], [238, 276], [238, 276], [237, 276], [237, 276], [236, 276], [236, 276], [236, 277], [236, 277], [235, 277], [235, 277], [234, 278], [234, 278], [233, 279], [233, 279], [232, 280], [232, 280], [231, 280], [231, 280], [230, 280], [230, 280], [230, 280], [229, 280], [229, 280]]

And this will shed some more light:

>>> '123,456'.split(',')
['123', '456']
>>> list(map(int, '123,456'.split(',')))
[123, 456]

Upvotes: 2

Related Questions