Reputation: 4144
My data has two fields for dates in days, but among the tens of millions of lines, there are errors in some rows. I'd rather just convert the columns into datetime right upon reading in the data, for speed and memory constraints, yet then I cannot seem to coerce the process to just leave NaT
for the invalid strings instead of raising an error. I am able to read in the fields as uint32
for later coerced conversion to datetime, yet that seems unnecessarily slow once we have the parse_dates
option.
The sample data follows at the bottom of the post.
This approach works beautifully as longs as the data is always well formatted: pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2])
However, this results in dtype object if some rows cannot be converted, and then later operations break.
pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2],dtype={'LopNr': np.uint32,'INDATUMA': np.uint32,'UTDATUMA': np.uint32,'DIAGNOS': np.object})
, assuming the dtype would apply to the data before it enters the converter, hiccups on a string in some of the rows: ValueError: invalid literal for long() with base 10: 'string'
pd.read_table(filename,usecols=[0,8,9,11],parse_dates=[1,2],dtype={'LopNr': np.uint32,'INDATUMA': 'datetime64','UTDATUMA': 'datetime64','DIAGNOS': np.object})
works neither, as TypeError: the dtype <M8 is not supported for parsing, pass this column using parse_dates instead
What is an efficient way to proceed with this kind of data if I am ready to dismiss the rows whose string format is not parseable as a date?
The sample data (with all meaningful digits replaced with . for confidentiality) follows, note that two dates are misformatted in the last two rows:
LopNr AR KON ALDER LKF SJUKHUS MVO LT_KLIN INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD INSATT UTSATT VTID EKOD. EKOD. ICD PEKARE OPD. OPD. OPD. OPD. OPD. OPD. OPD. OPD. OPD. OPD.. OPD.. OPD..
.. .... . . ...... ..... ... ... 19970320 19970320 S... S... . . . . W.... ..
.. .... . . ...... ..... ... ... 19970306 19970307 S... S... . . . . W.... ..
.. .... . .. ...... ..... ... ... 19961219 19970208 Z... Z... S... . . . .. W.... ..
.. .... . .. ...... ..... ... ... 19970208 19970320 Z... Z... S... . . . .. W.... ..
.. .... . .. ...... ..... ... ... 19970604 19970611 I... I... I... I... . . . . ..
.. .... . .. ...... ..... ... ... 19970402 19970406 O800A O800A . . . . ..
.. .... . .. ...... ..... ... ... 19970412 19970415 R... R... I... Z... J... . . . . ..
.. .... . .. ...... ..... ... ... 19970520 19970523 R... R... I... J... V.... . . . . .. .
.. .... . .. ...... ..... ... ... 19970504 19970507 I... I... . . . . ..
.. .... . .. ...... ..... ... ... 1997050 19970507 I... I... . . . . ..
.. .... . .. ...... ..... ... ... 19970504 string I... I... . . . . ..
Upvotes: 1
Views: 5928
Reputation: 52276
You could do this:
parser = lambda x : pd.to_datetime(x, coerce=True)
pd.read_table(..., parse_dates=[0,1], date_parser=parser)
But this isn't any faster than just reading in and then parsing. Even in the default case, parsing happens after the data is read in.
df = pd.read_table(...)
df['INDATUMA'] = pd.to_datetime(df['INDATUMA'], coerce=True)
df['UTDATUMA'] = pd.to_datetime(df['UTDATUMA'], coerce=True)
If your dates are in the same format, you may see a very significant speedup by passing the format - there is a fast path specifically for YYYYMMDD formatted dates.
df = pd.read_table(...)
df['INDATUMA'] = pd.to_datetime(df['INDATUMA'], coerce=True, format='%Y%m%d')
Also, note that in next version of pandas
(0.17), the coerce
argument will be decremented, and you will instead pass errors='coerce'
Upvotes: 2