evsmith
evsmith

Reputation: 505

Convert, or unformat, a string to variables (like format(), but in reverse) in Python

I have strings of the form Version 1.4.0\n and Version 1.15.6\n, and I'd like a simple way of extracting the three numbers from them. I know I can put variables into a string with the format method; I basically want to do that backwards, like this:

# So I know I can do this:
x, y, z = 1, 4, 0
print 'Version {0}.{1}.{2}\n'.format(x,y,z)
# Output is 'Version 1.4.0\n'

# But I'd like to be able to reverse it:

mystr='Version 1.15.6\n'
a, b, c = mystr.unformat('Version {0}.{1}.{2}\n')

# And have the result that a, b, c = 1, 15, 6

Someone else I found asked the same question, but the reply was specific to their particular case: Use Python format string in reverse for parsing

A general answer (how to do format() in reverse) would be great! An answer for my specific case would be very helpful too though.

Upvotes: 39

Views: 25207

Answers (7)

Rick
Rick

Reputation: 45281

The pypi package parse serves this purpose well:

pip install parse

Can be used like this:

>>> import parse
>>> result=parse.parse('Version {0}.{1}.{2}\n', 'Version 1.15.6\n')
<Result ('1', '15', '6') {}>
>>> values=list(result)
>>> print(values)
['1', '15', '6']

Note that the docs say the parse package does not EXACTLY emulate the format specification mini-language by default; it also uses some type-indicators specified by re. Of special note is that s means "whitespace" by default, rather than str. This can be easily modified to be consistent with the format specification by changing the default type for s to str (using extra_types):

result = parse.parse(format_str, string, extra_types=dict(s=str))

Here is a conceptual idea for a modification of the string.Formatter built-in class using the parse package to add unformat capability that I have used myself:

import parse
from string import Formatter
class Unformatter(Formatter):
    '''A parsable formatter.'''
    def unformat(self, format, string, extra_types=dict(s=str), evaluate_result=True):
        return parse.parse(format, string, extra_types, evaluate_result)
    unformat.__doc__ = parse.Parser.parse.__doc__

IMPORTANT: the method name parse is already in use by the Formatter class, so I have chosen unformat instead to avoid conflicts.

UPDATE: You might use it like this- very similar to the string.Formatter class.

Formatting (identical to '{:d} {:d}'.format(1, 2)):

>>> formatter = Unformatter() 
>>> s = formatter.format('{:d} {:d}', 1, 2)
>>> s
'1 2' 

Unformatting:

>>> result = formatter.unformat('{:d} {:d}', s)
>>> result
<Result (1, 2) {}>
>>> tuple(result)
(1, 2)

This is of course of very limited use as shown above. However, I've put up a pypi package (parmatter - a project originally for my own use but maybe others will find it useful) that explores some ideas of how to put this idea to more useful work. The package relies heavily on the aforementioned parse package. EDIT: a few years of experience under my belt later, I realized parmatter (my first package!) was a terrible, embarrassing idea and have since deleted it.

Upvotes: 7

nonagon
nonagon

Reputation: 3483

Here's a solution in case you don't want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.

def match_format_string(format_str, s):
    """Match s against the given format string, return dict of matches.

    We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
    {:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
    need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
    reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).

    We raise if the format string does not match s.

    Example:
    fs = '{test}-{flight}-{go}'
    s = fs.format('first', 'second', 'third')
    match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
    """

    # First split on any keyword arguments, note that the names of keyword arguments will be in the
    # 1st, 3rd, ... positions in this list
    tokens = re.split(r'\{(.*?)\}', format_str)
    keywords = tokens[1::2]

    # Now replace keyword arguments with named groups matching them. We also escape between keyword
    # arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
    tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
    tokens[0::2] = map(re.escape, tokens[0::2])
    pattern = ''.join(tokens)

    # Use our pattern to match the given string, raise if it doesn't match
    matches = re.match(pattern, s)
    if not matches:
        raise Exception("Format string did not match")

    # Return a dict with all of our keywords and their values
    return {x: matches.group(x) for x in keywords}

Upvotes: 0

DanH
DanH

Reputation: 5818

Just to build on Uche's answer, I was looking for a way to reverse a string via a pattern with kwargs. So I put together the following function:

def string_to_dict(string, pattern):
    regex = re.sub(r'{(.+?)}', r'(?P<_\1>.+)', pattern)
    values = list(re.search(regex, string).groups())
    keys = re.findall(r'{(.+?)}', pattern)
    _dict = dict(zip(keys, values))
    return _dict

Which works as per:

>>> p = 'hello, my name is {name} and I am a {age} year old {what}'

>>> s = p.format(name='dan', age=33, what='developer')
>>> s
'hello, my name is dan and I am a 33 year old developer'
>>> string_to_dict(s, p)
{'age': '33', 'name': 'dan', 'what': 'developer'}

>>> s = p.format(name='cody', age=18, what='quarterback')
>>> s
'hello, my name is cody and I am a 18 year old quarterback'
>>> string_to_dict(s, p)
{'age': '18', 'name': 'cody', 'what': 'quarterback'}

Upvotes: 11

Juh_
Juh_

Reputation: 15549

Some time ago I made the code below that does the reverse of format but limited to the cases I needed.

And, I never tried it, but I think this is also the purpose of the parse library

My code:

import string
import re

_def_re   = '.+'
_int_re   = '[0-9]+'
_float_re = '[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?'

_spec_char = '[\^$.|?*+()'

def format_parse(text, pattern):
    """
    Scan `text` using the string.format-type `pattern`

    If `text` is not a string but iterable return a list of parsed elements

    All format-like pattern cannot be process:
      - variable name cannot repeat (even unspecified ones s.t. '{}_{0}')
      - alignment is not taken into account
      - only the following variable types are recognized:
           'd' look for and returns an integer
           'f' look for and returns a  float

    Examples::

        res = format_parse('the depth is -42.13', 'the {name} is {value:f}')
        print res
        print type(res['value'])
        # {'name': 'depth', 'value': -42.13}
        # <type 'float'>

        print 'the {name} is {value:f}'.format(**res)
        # 'the depth is -42.130000'

        # Ex2: without given variable name and and invalid item (2nd)
        versions = ['Version 1.4.0', 'Version 3,1,6', 'Version 0.1.0']
        v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
        # v=[{0: 1, 1: 4, 2: 0}, None, {0: 0, 1: 1, 2: 0}]

    """
    # convert pattern to suitable regular expression & variable name
    v_int = 0   # available integer variable name for unnamed variable 
    cur_g = 0   # indices of current regexp group name 
    n_map = {}  # map variable name (keys) to regexp group name (values)
    v_cvt = {}  # (optional) type conversion function attached to variable name
    rpattern = '^'    # stores to regexp pattern related to format pattern        

    for txt,vname, spec, conv in string.Formatter().parse(pattern):
        # process variable name
        if len(vname)==0:
            vname = v_int
            v_int += 1
        if vname not in n_map:
            gname = '_'+str(cur_g)
            n_map[vname] = gname
            cur_g += 1                   
        else:    
            gname = n_map[vname]

        # process type of required variables 
        if   'd' in spec: vtype = _int_re;   v_cvt[vname] = int
        elif 'f' in spec: vtype = _float_re; v_cvt[vname] = float
        else:             vtype = _def_re;

        # check for regexp special characters in txt (add '\' before)
        txt = ''.join(map(lambda c: '\\'+c if c in _spec_char else c, txt))

        rpattern += txt + '(?P<'+gname+'>' + vtype +')'

    rpattern += '$'

    # replace dictionary key from regexp group-name to the variable-name 
    def map_result(match):
        if match is None: return None
        match = match.groupdict()
        match = dict((vname, match[gname]) for vname,gname in n_map.iteritems())
        for vname, value in match.iteritems():
            if vname in v_cvt:
                match[vname] = v_cvt[vname](value)
        return match

    # parse pattern
    if isinstance(text,basestring):
        match = re.search(rpattern, text)
        match = map_result(match)
    else:
        comp  = re.compile(rpattern)
        match = map(comp.search, text)
        match = map(map_result, match)

    return match

for your case, here is a use example:

versions = ['Version 1.4.0', 'Version 3.1.6', 'Version 0.1.0']
v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
# v=[{0: 1, 1: 4, 2: 0}, {0: 3, 1: 1, 2: 6}, {0: 0, 1: 1, 2: 0}]

# to get the versions as a list of integer list, you can use:
v = [[vi[i] for i in range(3)] for vi in filter(None,v)]

Note the filter(None,v) to remove unparsable versions (which return None). Here it is not necessary.

Upvotes: 3

Levon
Levon

Reputation: 143102

This

a, b, c = (int(i) for i in mystr.split()[1].split('.'))

will give you int values for a, b and c

>>> a
1
>>> b
15
>>> c
6

Depending on how regular or irregular, i.e., consistent, your number/version formats will be, you may want to consider the use of regular expressions, though if they will stay in this format, I would favor the simpler solution if it works for you.

Upvotes: 2

Uche
Uche

Reputation: 80

Actually the Python regular expression library already provides the general functionality you are asking for. You just have to change the syntax of the pattern slightly

>>> import re
>>> from operator import itemgetter
>>> mystr='Version 1.15.6\n'
>>> m = re.match('Version (?P<_0>.+)\.(?P<_1>.+)\.(?P<_2>.+)', mystr)
>>> map(itemgetter(1), sorted(m.groupdict().items()))
['1', '15', '6']

As you can see, you have to change the (un)format strings from {0} to (?P<_0>.+). You could even require a decimal with (?P<_0>\d+). In addition, you have to escape some of the characters to prevent them from beeing interpreted as regex special characters. But this in turm can be automated again e.g. with

>>> re.sub(r'\\{(\d+)\\}', r'(?P<_\1>.+)', re.escape('Version {0}.{1}.{2}'))
'Version\\ (?P<_0>.+)\\.(?P<_1>.+)\\.(?P<_2>.+)'

Upvotes: 4

Willian
Willian

Reputation: 2445

>>> import re
>>> re.findall('(\d+)\.(\d+)\.(\d+)', 'Version 1.15.6\n')
[('1', '15', '6')]

Upvotes: 9

Related Questions