MisterNox
MisterNox

Reputation: 1455

Parse a svg related string with regular expressions in python

I would like to parse the d attribute commands within a path element of a svg. And I would like to do it in an efficient way. Therefore I decided to go with a regex function to avoid using several loops.

What I want to achieve is to put the command letter along with its numeric values in a tuple and store all those tuples in a list e.g. [('M', '3', '18'), ('h', '10'), ...]

Depending on the command letter there can be one to six numeric values following. These numeric value can have a dot ('.45') or a minus ('-3') or both in it ('-.55'). And there are not always spaces seperating them. e.g. 'c -.55.45 0 1 '.

My Approach:

Here is what I tried so far. I tried to separate them with the re.findall method. But after that I had to group them with an additional loop and those connected numeric values with dots are still connected. Furthermore I would like to integrate the replace method into the findall patterns.

# Just an extract of a d command
d = 'M20 3H4c-.55 0-1 .45-1 1v6c0 .55.45 1 1 1h16'   
commands = re.findall("[mMzZlLhHvVcCsSqGtTaA]|[0-99\-.]+", d.replace("-", " -"))

#output: ['M', '20', '3', 'H', '4', 'c', '-.55', '0', '-1', '.45', '-1', '1', 'v', '6', 'c', '0', '.55.45', '1', '1', '1', 'h', '16']

#goal: [('M', '20', '3'), ('H', '4'), ('c', '-.55', '0', '-1', '.45', '-1', '1'), ('v', '6'), ('c', '0', '.55', '.45', '1', '1', '1'), ('h', '16')]

Those dotted connected numeric values seems to be easy. I just separate them on the dots. But this is not possible because I could have a value like '1.55'. But then this value is separated with a space to the other value ('.55 1.45'). As I had a hard time with those regex patterns, it would be awesome if someone has a solution or at least could guide me into the right direction.

If I missed something or you need more information, just tell me and I will provide them. Thank you in advance!

Upvotes: 2

Views: 452

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

If there can be only zero to six arguments, the best you can do with a one-regex approach is to use

re.findall("([mMzZlLhHvVcCsSqGtTaA])(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?", d)

See the regex demo. The (?:\s*(-?\d*\.?\d+))? pattern is repeated 6 times to match 1 to 6 arguments and capture each of them into its own group. (?:...)? is an optional non-capturing group, \s*(-?\d*\.?\d+) matches 0+ whitespaces (\s*), (-?\d*\.?\d+) captures into a group an optional - (-?), 0+ digits (\d*), an optional dot (\.?) and 1+ digits (\d+).

See Python demo:

import re
d = 'M0 0h24v24H0z'   
commands = re.findall(r"([mMzZlLhHvVcCsSqGtTaA])(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?(?:\s*(-?\d*\.?\d+))?", d)
print([tuple(list(filter(None, x))) for x in commands])
# => [('M', '0', '0'), ('h', '24'), ('v', '24'), ('H', '0'), ('z',)]

Upvotes: 1

Related Questions