Reputation: 2441
I have a list of values - all strings. I want to convert these values to their respective datatypes. I have mapping of values to the types information available.
There are three different datatypes: int, str, datetime. The code needs to be able to handle the error cases with the data.
I am doing something like:-
tlist = [ 'some datetime value', '12', 'string', .... ]
#convert it to: [ datetime object, 12, 'string', ....]
error_data = ['', ' ', '?', ...]
d = { 0: lambda x: datetime.strptime(x,...) if x not in error_data else x,
1: lambda x: int(x) if x not in error_data else 0,
2: lambda x: x
...
}
result = [ d[i](j) for i, j in enumerate(tlist) ]
The list to convert is very long, like 180 values and I need to do it for thousands of such lists. The performance of above code is very poor. What is the fastest way to do it?
Thank you
Upvotes: 6
Views: 3095
Reputation: 2441
Thank you guys for all those approaches. Yeah, I tried pretty much all the approaches mentioned, but none did perform well.
I tried the following approach and it worked pretty well for my performance needs. This is what I did.
I've inserted value 0 for all the int error values with code like -
[i] = value if value != '' else 0
Instead of coercing value by value using dictionary, I coerced all the value at once using a list.
def coerce(l):
return [ l[0], int(l[1]), int(l[2]) ... ]
My Observations:
Upvotes: 0
Reputation: 27575
There's an incongruity in your code:
if all elements in a list are strings, you can't write datetime(x)
with x being a string
It depicts nothing since it is incongruous. The complexity of what is not in you code doesn't justify the weirdness that is in your code. As long as you won't explain how you can pass a string as argument to the function datetime.datetime(), nobody will be able to help you, IMO.
I think that it's better to create directly your list at the moment the file is read.
I wrote an example:
.
First, I created a CSV file with the following code:
import csv
from random import randint,choice
from time import gmtime
xx = ['Whose', 'all', 'birth', 'just', 'infant', 'William',
'dearest', 'rooms', 'find', 'Deserts', 'saucy', 'His',
'how', 'considerate', 'only', 'other', 'Houses', 'has',
'Fanny', 'them', 'his', 'very', 'dispense', 'early',
'words', 'not', 'thus', 'now', 'pettish', 'Worth']
def gen(n):
for i in xrange(n):
yield ['AAAA','%d/%02d/%02d %02d:%02d:%02d' % gmtime(randint(0,80000000))[0:6],'@@@']
yield ['BBBB',randint(100,999),'^^^^^^']
yield ['CCCC',choice(xx),'-----------------']
with open('zzz.txt','wb') as f:
writ = csv.writer(f, delimiter='#')
writ.writerows(x for x in gen(60))
The structure of the CSV file is so:
AAAA#1972/02/11 08:53:53#@@@
BBBB#557#^^^^^^
CCCC#dearest#-----------------
AAAA#1971/10/15 06:55:20#@@@
BBBB#668#^^^^^^
CCCC#?#-----------------
AAAA#1972/07/13 11:10:05#@@@
BBBB#190#^^^^^^
CCCC#infant#-----------------
AAAA#1971/11/22 19:31:42#@@@
BBBB#202#^^^^^^
CCCC##-----------------
AAAA#1971/06/12 05:48:39#@@@
BBBB#81#^^^^^^
CCCC#find#-----------------
AAAA#1970/12/09 06:26:29#@@@
BBBB#72#^^^^^^
CCCC#find#-----------------
AAAA#1972/07/05 10:45:32#@@@
BBBB#270#^^^^^^
CCCC#rooms#-----------------
AAAA#1972/06/23 05:52:20#@@@
BBBB#202#^^^^^^
CCCC##-----------------
AAAA#1972/03/21 23:06:47#@@@
BBBB#883#^^^^^^
CCCC#William#-----------------
...... etc
.
The following code extracts the data in a similar way to what you want.
There is no need of a dictionary, a tuple is sufficient. Given the structure of the CSV file created, I defined funcs = 60 * (to_dt, int, lambda x: x)
but you'll use the succession of functions that is your dictionary's values (sorted)
import re
import csv
from datetime import datetime
from itertools import izip
reg = re.compile('(\d{4})/(\d\d)/(\d\d) (\d\d):(\d\d):(\d\d)')
def to_dt(x, error_data = ('', ' ', '?')):
if x in error_data:
return x
else:
return datetime(*map(int,reg.match(x).groups()))
def teger(x, error_data = ('', ' ', '?')):
if x in error_data:
return 0
else:
return int(x)
funcs = 60 * (to_dt, int, lambda y: y)
with open('zzz.txt','rb') as f:
rid = csv.reader(f, delimiter='#')
li = [fct(x[1]) for fct,x in izip(funcs,rid)]
# display
it = (str(el) for el in li).next
print '\n'.join('%-21s %4s %10s' % (it(),it(),it()) for i in xrange(60))
result
1972-02-11 08:53:53 557 dearest
1971-10-15 06:55:20 668 ?
1972-07-13 11:10:05 190 infant
1971-11-22 19:31:42 202
1971-06-12 05:48:39 81 find
1970-12-09 06:26:29 72 find
1972-07-05 10:45:32 270 rooms
1972-06-23 05:52:20 202
1972-03-21 23:06:47 883 William
1970-02-08 23:47:26 617
1970-10-08 09:09:33 387 William
1971-04-30 11:05:07 721 ?
1970-02-12 11:57:48 827 Deserts
1972-03-27 21:30:39 363 just
1971-06-02 00:23:52 977
1970-04-20 04:38:38 113 William
1971-01-20 23:10:26 75 Whose
1971-07-01 12:46:13 352 dearest
1971-01-31 17:01:34 220 William
1970-06-09 20:38:52 148 rooms
1971-08-08 07:42:10 146
1970-01-28 15:17:41 903 find
...............etc
Upvotes: 1
Reputation: 27216
Something like:
>>> values = ["12", "a", "bcd", "2.2"]
>>> types = [int, int, str, float]
>>> defaults = {int: 0, float: 0.0}
>>> res = []
>>> for v, f in itertools.izip(values, types): #Just use zip for Python 3+.
try:
res.append(f(v))
except ValueError:
res.append(defaults[f])
>>> print(res)
[12, 0, 'bcd', 2.2]
Edit:
This doesn't handle datetime values. My solution for that is use str
for that, and convert to datetime after the loop, like:
res[0] = datetime.strptime(res[0], "...")
Both getting and setting the list item has O(1) complexity, so it shouldn't be a problem.
Upvotes: 2
Reputation: 15944
As a variant on utdemir's answer, if error values are fairly infrequent, then you could optimize the common case:
>>> values = ["12", "a", "bcd", "2.2"]
>>> types = [int, int, str, float]
>>> defaults = {int: 0, float: 0.0}
>>> try: res = [f(v) for v,f in zip(values,types)]
... except:
... res = []
... for v, f in zip(values, types):
... try:
... res.append(f(v))
... except ValueError:
... res.append(defaults[f])
I.e., first try converting the whole line assuming that nothing will go wrong. If anything does go wrong, then go back and convert values one at a time, fixing any error values.
Upvotes: 0
Reputation: 16455
If your datetime value is always consistant why not let the type casting handle the invalid data that you're trying to manage in error_data. This is not as sexy as some solutions but makes managing type conversion based on position of data in list a little easier to maintain and expand upon.
def convert(position, val):
if position == 0:
try:
return datetime.strptime(val, '%Y-%m-%d %H:%M:%S') # assuming date is in a constant format
except ValueError:
return val
elif position in (1, 15, 16): # assuming that you have other int values in other "columns"
try:
return int(val)
except ValueError:
return 0
else: # string type
return val
result = [convert(i,j) for i, j in enumerate(tlist)]
Upvotes: 2
Reputation: 375624
I don't know if this would be much faster, but to me it's clearer:
tlist = [ 'some datetime value', '12', 'string', .... ]
#convert it to: [ datetime object, 12, 'string', ....]
error_data = set(['', ' ', '?', ...])
def s(x):
return x
def d(x):
return datetime(x) if x not in error_data else x
def i(x):
return int(x) if x not in error_data else 0
types = [ d, i, s, s, s, i, i, d, i, ... ]
result = [ t(x) for t, x in zip(types, tlist) ]
As others have mentioned, I'm using a set for the error values, which will be faster than the list you had.
Upvotes: 0
Reputation: 9173
Since you know the types you want to convert to, you probably won't get a performance boost from trying to optimize your conversions. The poor performance probably comes from repeatedly iterating over error_data
. If it is possible, reconstruct your error_data
list as a set
to exploit nature of that type:
error_set = set((err, None) for err in error_data)
Then proceed as you have been. Further improvements would require profiling your code to actually determine where time is being spent.
Upvotes: 1