Parameterized string formats yields unexpected results

Question

I have a dataframe that contains a data dictionary for a microdataset, including field width of the string fields, where those fields are zero padded.

I ultimately want to use that dataframe to create a converter dictionary for each variable to pass to the pd.read_csv call, where the converter function lambdas a string formatter with parameterized width which varies for each variable.

In other words, I want to generate a dictionary of functions, each with their own string format templates so that each variable can be loaded downstream with the appropriate zero padding.

To do this I iterate over the rows and use the variable denoting the width to create the string formater expression, with variable width. This seems to work.
I store this formatter in a dictionary with an entry for each row.

However, the problem is when I key the dictionary subsequently and pass an argument, regardless of what the string width parameter happened to be, it pads with length four.

Example:

# dict for storing the mapping
coll={}

# mock data (var name and associated width)
df=pd.DataFrame(data={'nme':['a','b','c','d'],'width':[2,2,3,4]})

# iterate rows
for _,dta in df.iterrows():

    # create variable width format string from width variable
    # mix of old / new string format approach

    formatstring = ('{:0>%s}'%dta.width)

    # turn string into a function, with string to be padded as argument

    formatfunc = lambda x: formatstring.format(x)
    coll[dta.nme]=formatfunc

    print 'var {}; width {}'.format(dta.nme, dta.width)
    print formatstring

And the running output is as follows--notably, the string formatter looks kosher, with variable width.:

var a; width 2
{:0>2}
var b; width 2
{:0>2}
var c; width 3
{:0>3}
var d; width 4
{:0>4}

But when I key an entry in the coll dictionary, I invariably get a padding to length 4. What did I miss, and is this a practical approach?

coll['a'](3)
'0003'

Here I expected a padded string with length 2 for the key a. Instead I get length 4.

Qusai Alothman · Accepted Answer

That's because your lambda is using the global variable formatstring when computed. formatstring equals {:0>4}, the value set to it in the last iteration.

Another simpler example:

y = 5
f = lambda x: print(x+y)
f(2) # prints 7
y = 10
f(2) # prints 12

How to solve this

One way to solve this is to get rid of the lambdas altogether. A hacky example:

df.set_index('nme',inplace=True)
coll = df.to_dict(orient='index')   

'0'*coll['a']['width']+str(3)  # prints '003'

You can convert the last line to a function (or a lambda) if you want.

Parameterized string formats yields unexpected results

Answers (1)

How to solve this

Related Questions