Sujit
Sujit

Reputation: 2441

Fastest way to cast values to their respective datatypes in Python

I have a list of values - all strings. I want to convert these values to their respective datatypes. I have mapping of values to the types information available.

There are three different datatypes: int, str, datetime. The code needs to be able to handle the error cases with the data.

I am doing something like:-

tlist =  [ 'some datetime value', '12', 'string', .... ]

#convert it to: [ datetime object, 12, 'string', ....]

error_data = ['', ' ', '?', ...]

d = { 0: lambda x: datetime.strptime(x,...) if x not in error_data else x, 
      1: lambda x: int(x) if x not in error_data else 0,
      2: lambda x: x 
      ...
     }

result = [ d[i](j) for i, j in enumerate(tlist) ]

The list to convert is very long, like 180 values and I need to do it for thousands of such lists. The performance of above code is very poor. What is the fastest way to do it?

Thank you

Upvotes: 6

Views: 3095

Answers (7)

Sujit
Sujit

Reputation: 2441

Thank you guys for all those approaches. Yeah, I tried pretty much all the approaches mentioned, but none did perform well.

I tried the following approach and it worked pretty well for my performance needs. This is what I did.

  1. I handled the datetime values as suggested by utdemir
  2. I've inserted value 0 for all the int error values with code like -

    [i] = value if value != '' else 0

  3. Instead of coercing value by value using dictionary, I coerced all the value at once using a list.

    def coerce(l):
    return [ l[0], int(l[1]), int(l[2]) ... ]

My Observations:

  1. handling the error cases separately and coercing at once helps a lot.
  2. letting the code recover from an exception takes a lot of time.
  3. Going over a list multiple times doesn't really hurt( I did it for handling error cases)
  4. To give an idea, for couple of hours of data (0.60 million records), my approach brought down the runtime from 6-7 minutes to 2m30s

Upvotes: 0

eyquem
eyquem

Reputation: 27575

There's an incongruity in your code:

if all elements in a list are strings, you can't write datetime(x) with x being a string

edit

It depicts nothing since it is incongruous. The complexity of what is not in you code doesn't justify the weirdness that is in your code. As long as you won't explain how you can pass a string as argument to the function datetime.datetime(), nobody will be able to help you, IMO.

Edit

I think that it's better to create directly your list at the moment the file is read.

I wrote an example:

.

First, I created a CSV file with the following code:

import csv
from random import randint,choice
from time import gmtime


xx = ['Whose', 'all', 'birth', 'just', 'infant', 'William',
      'dearest', 'rooms', 'find', 'Deserts', 'saucy', 'His',
      'how', 'considerate', 'only', 'other', 'Houses', 'has',
      'Fanny', 'them', 'his', 'very', 'dispense', 'early',
      'words', 'not', 'thus', 'now', 'pettish', 'Worth']

def gen(n):
    for i in xrange(n):
        yield ['AAAA','%d/%02d/%02d %02d:%02d:%02d' % gmtime(randint(0,80000000))[0:6],'@@@']
        yield ['BBBB',randint(100,999),'^^^^^^']
        yield ['CCCC',choice(xx),'-----------------']

with open('zzz.txt','wb') as f:
    writ = csv.writer(f, delimiter='#')
    writ.writerows(x for x in gen(60))

The structure of the CSV file is so:

AAAA#1972/02/11 08:53:53#@@@
BBBB#557#^^^^^^
CCCC#dearest#-----------------
AAAA#1971/10/15 06:55:20#@@@
BBBB#668#^^^^^^
CCCC#?#-----------------
AAAA#1972/07/13 11:10:05#@@@
BBBB#190#^^^^^^
CCCC#infant#-----------------
AAAA#1971/11/22 19:31:42#@@@
BBBB#202#^^^^^^
CCCC##-----------------
AAAA#1971/06/12 05:48:39#@@@
BBBB#81#^^^^^^
CCCC#find#-----------------
AAAA#1970/12/09 06:26:29#@@@
BBBB#72#^^^^^^
CCCC#find#-----------------
AAAA#1972/07/05 10:45:32#@@@
BBBB#270#^^^^^^
CCCC#rooms#-----------------
AAAA#1972/06/23 05:52:20#@@@
BBBB#202#^^^^^^
CCCC##-----------------
AAAA#1972/03/21 23:06:47#@@@
BBBB#883#^^^^^^
CCCC#William#-----------------
...... etc

.

The following code extracts the data in a similar way to what you want.

There is no need of a dictionary, a tuple is sufficient. Given the structure of the CSV file created, I defined funcs = 60 * (to_dt, int, lambda x: x) but you'll use the succession of functions that is your dictionary's values (sorted)

import re
import csv
from datetime import datetime
from itertools import izip

reg = re.compile('(\d{4})/(\d\d)/(\d\d) (\d\d):(\d\d):(\d\d)')

def to_dt(x, error_data = ('', ' ', '?')):
    if x in error_data:
        return x
    else:
        return datetime(*map(int,reg.match(x).groups()))

def teger(x,  error_data = ('', ' ', '?')):
    if x in error_data:
        return 0
    else:
        return int(x)

funcs = 60 * (to_dt, int, lambda y: y)

with open('zzz.txt','rb') as f:
    rid = csv.reader(f, delimiter='#')
    li = [fct(x[1]) for fct,x in izip(funcs,rid)]



# display
it = (str(el) for el in li).next
print '\n'.join('%-21s %4s  %10s' % (it(),it(),it()) for i in xrange(60))

result

1972-02-11 08:53:53    557     dearest
1971-10-15 06:55:20    668           ?
1972-07-13 11:10:05    190      infant
1971-11-22 19:31:42    202            
1971-06-12 05:48:39     81        find
1970-12-09 06:26:29     72        find
1972-07-05 10:45:32    270       rooms
1972-06-23 05:52:20    202            
1972-03-21 23:06:47    883     William
1970-02-08 23:47:26    617            
1970-10-08 09:09:33    387     William
1971-04-30 11:05:07    721           ?
1970-02-12 11:57:48    827     Deserts
1972-03-27 21:30:39    363        just
1971-06-02 00:23:52    977            
1970-04-20 04:38:38    113     William
1971-01-20 23:10:26     75       Whose
1971-07-01 12:46:13    352     dearest
1971-01-31 17:01:34    220     William
1970-06-09 20:38:52    148       rooms
1971-08-08 07:42:10    146            
1970-01-28 15:17:41    903        find
...............etc

Upvotes: 1

utdemir
utdemir

Reputation: 27216

Something like:

>>> values = ["12", "a", "bcd", "2.2"]
>>> types = [int, int, str, float]
>>> defaults = {int: 0, float: 0.0}
>>> res = []
>>> for v, f in itertools.izip(values, types): #Just use zip for Python 3+.
    try:
        res.append(f(v))
    except ValueError:
        res.append(defaults[f])
>>> print(res)
[12, 0, 'bcd', 2.2]

Edit:

This doesn't handle datetime values. My solution for that is use str for that, and convert to datetime after the loop, like:

res[0] = datetime.strptime(res[0], "...")

Both getting and setting the list item has O(1) complexity, so it shouldn't be a problem.

Upvotes: 2

Edward Loper
Edward Loper

Reputation: 15944

As a variant on utdemir's answer, if error values are fairly infrequent, then you could optimize the common case:

>>> values = ["12", "a", "bcd", "2.2"]
>>> types = [int, int, str, float]
>>> defaults = {int: 0, float: 0.0}
>>> try: res = [f(v) for v,f in zip(values,types)]
... except: 
...     res = []
...     for v, f in zip(values, types):
...        try:
...             res.append(f(v))
...         except ValueError:
...             res.append(defaults[f])

I.e., first try converting the whole line assuming that nothing will go wrong. If anything does go wrong, then go back and convert values one at a time, fixing any error values.

Upvotes: 0

Philip Southam
Philip Southam

Reputation: 16455

If your datetime value is always consistant why not let the type casting handle the invalid data that you're trying to manage in error_data. This is not as sexy as some solutions but makes managing type conversion based on position of data in list a little easier to maintain and expand upon.

def convert(position, val):
    if position == 0:
        try:
            return datetime.strptime(val, '%Y-%m-%d %H:%M:%S') # assuming date is in a constant format
        except ValueError:
            return val
    elif position in (1, 15, 16): # assuming that you have other int values in other "columns"
        try:
            return int(val)
        except ValueError:
            return 0
    else: # string type
       return val

result = [convert(i,j) for i, j in enumerate(tlist)]

Upvotes: 2

Ned Batchelder
Ned Batchelder

Reputation: 375624

I don't know if this would be much faster, but to me it's clearer:

tlist =  [ 'some datetime value', '12', 'string', .... ]

#convert it to: [ datetime object, 12, 'string', ....]

error_data = set(['', ' ', '?', ...])

def s(x):
    return x

def d(x):
    return datetime(x) if x not in error_data else x

def i(x):
    return int(x) if x not in error_data else 0

types = [ d, i, s, s, s, i, i, d, i, ... ]

result = [ t(x) for t, x in zip(types, tlist) ]

As others have mentioned, I'm using a set for the error values, which will be faster than the list you had.

Upvotes: 0

multipleinterfaces
multipleinterfaces

Reputation: 9173

Since you know the types you want to convert to, you probably won't get a performance boost from trying to optimize your conversions. The poor performance probably comes from repeatedly iterating over error_data. If it is possible, reconstruct your error_data list as a set to exploit nature of that type:

error_set = set((err, None) for err in error_data)

Then proceed as you have been. Further improvements would require profiling your code to actually determine where time is being spent.

Upvotes: 1

Related Questions