Reputation: 3689
I have a dataframe that contains a data dictionary for a microdataset, including field width of the string fields, where those fields are zero padded.
I ultimately want to use that dataframe to create a converter dictionary for each variable to pass to the pd.read_csv
call, where the converter function lambdas a string formatter with parameterized width which varies for each variable.
In other words, I want to generate a dictionary of functions, each with their own string format templates so that each variable can be loaded downstream with the appropriate zero padding.
To do this I iterate over the rows and use the variable denoting the width to create the string formater expression, with variable width. This seems to work.
I store this formatter in a dictionary with an entry for each row.
However, the problem is when I key the dictionary subsequently and pass an argument, regardless of what the string width parameter happened to be, it pads with length four.
Example:
# dict for storing the mapping
coll={}
# mock data (var name and associated width)
df=pd.DataFrame(data={'nme':['a','b','c','d'],'width':[2,2,3,4]})
# iterate rows
for _,dta in df.iterrows():
# create variable width format string from width variable
# mix of old / new string format approach
formatstring = ('{:0>%s}'%dta.width)
# turn string into a function, with string to be padded as argument
formatfunc = lambda x: formatstring.format(x)
coll[dta.nme]=formatfunc
print 'var {}; width {}'.format(dta.nme, dta.width)
print formatstring
And the running output is as follows--notably, the string formatter looks kosher, with variable width.:
var a; width 2
{:0>2}
var b; width 2
{:0>2}
var c; width 3
{:0>3}
var d; width 4
{:0>4}
But when I key an entry in the coll
dictionary, I invariably get a padding to length 4. What did I miss, and is this a practical approach?
coll['a'](3)
'0003'
Here I expected a padded string with length 2 for the key a
. Instead I get length 4.
Upvotes: 1
Views: 98
Reputation: 2072
That's because your lambda is using the global variable formatstring
when computed. formatstring
equals {:0>4}
, the value set to it in the last iteration.
Another simpler example:
y = 5
f = lambda x: print(x+y)
f(2) # prints 7
y = 10
f(2) # prints 12
One way to solve this is to get rid of the lambdas altogether. A hacky example:
df.set_index('nme',inplace=True)
coll = df.to_dict(orient='index')
'0'*coll['a']['width']+str(3) # prints '003'
You can convert the last line to a function (or a lambda) if you want.
Upvotes: 1