László
László

Reputation: 4144

force datetime conversion, coerce datetime dtype, with read_table in pandas

My data has two fields for dates in days, but among the tens of millions of lines, there are errors in some rows. I'd rather just convert the columns into datetime right upon reading in the data, for speed and memory constraints, yet then I cannot seem to coerce the process to just leave NaT for the invalid strings instead of raising an error. I am able to read in the fields as uint32 for later coerced conversion to datetime, yet that seems unnecessarily slow once we have the parse_dates option.

The sample data follows at the bottom of the post.

This approach works beautifully as longs as the data is always well formatted: pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2])

However, this results in dtype object if some rows cannot be converted, and then later operations break.

pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2],dtype={'LopNr': np.uint32,'INDATUMA': np.uint32,'UTDATUMA': np.uint32,'DIAGNOS': np.object}), assuming the dtype would apply to the data before it enters the converter, hiccups on a string in some of the rows: ValueError: invalid literal for long() with base 10: 'string'

pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2],dtype={'LopNr': np.uint32,'INDATUMA': 'datetime64','UTDATUMA': 'datetime64','DIAGNOS': np.object}) works neither, as TypeError: the dtype <M8 is not supported for parsing, pass this column using parse_dates instead

What is an efficient way to proceed with this kind of data if I am ready to dismiss the rows whose string format is not parseable as a date?

The sample data (with all meaningful digits replaced with . for confidentiality) follows, note that two dates are misformatted in the last two rows:

LopNr   AR  KON ALDER   LKF SJUKHUS MVO LT_KLIN INDATUMA    UTDATUMA    HDIA    DIAGNOS OP  PVARD   INSATT  UTSATT  VTID    EKOD.   EKOD.   ICD PEKARE  OPD.    OPD.    OPD.    OPD.    OPD.    OPD.    OPD.    OPD.    OPD.    OPD..   OPD..   OPD..
..  ....    .   .   ......  .....   ... ... 19970320    19970320    S...    S...        .   .   .   .   W....       ..              
..  ....    .   .   ......  .....   ... ... 19970306    19970307    S...    S...        .   .   .   .   W....       ..              
..  ....    .   ..  ......  .....   ... ... 19961219    19970208    Z...    Z... S...       .   .   .   ..  W....       ..          
..  ....    .   ..  ......  .....   ... ... 19970208    19970320    Z...    Z... S...       .   .   .   ..  W....       ..          
..  ....    .   ..  ......  .....   ... ... 19970604    19970611    I...    I... I... I...      .   .   .   .           ..          
..  ....    .   ..  ......  .....   ... ... 19970402    19970406    O800A   O800A       .   .   .   .           ..              
..  ....    .   ..  ......  .....   ... ... 19970412    19970415    R...    R... I... Z... J...     .   .   .   .           ..      
..  ....    .   ..  ......  .....   ... ... 19970520    19970523    R...    R... I... J...  V....   .   .   .   .           ..  .       
..  ....    .   ..  ......  .....   ... ... 19970504    19970507    I...    I...        .   .   .   .           ..              
..  ....    .   ..  ......  .....   ... ... 1997050     19970507    I...    I...        .   .   .   .           ..              
..  ....    .   ..  ......  .....   ... ... 19970504    string      I...    I...        .   .   .   .           ..              

Upvotes: 1

Views: 5928

Answers (1)

chrisb
chrisb

Reputation: 52276

You could do this:

parser = lambda x : pd.to_datetime(x, coerce=True)
pd.read_table(..., parse_dates=[0,1], date_parser=parser)

But this isn't any faster than just reading in and then parsing. Even in the default case, parsing happens after the data is read in.

df = pd.read_table(...)
df['INDATUMA'] = pd.to_datetime(df['INDATUMA'], coerce=True)
df['UTDATUMA'] = pd.to_datetime(df['UTDATUMA'], coerce=True)

If your dates are in the same format, you may see a very significant speedup by passing the format - there is a fast path specifically for YYYYMMDD formatted dates.

df = pd.read_table(...)
df['INDATUMA'] = pd.to_datetime(df['INDATUMA'], coerce=True, format='%Y%m%d')

Also, note that in next version of pandas (0.17), the coerce argument will be decremented, and you will instead pass errors='coerce'

Upvotes: 2

Related Questions