Reputation: 1918
I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.
col1 col2
-----------------------------------
0 A 2017-02-04 10:41:00.0000000
1 B 2017-02-04 10:41:00.0000000
2 C 2017-02-04 10:41:00.0000000
3 D 2017-02-04 10:41:00.0000000
4 E 2017-02-03 06:13:00.0000000
Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?
I've tried:
df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)
which results in an error:
to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
I also tried:
pd.to_datetime(df.stack(), errors='ignore', format='%Y%m%d% H%M%S%f').unstack()
and
pd.to_datetime(df.stack(), errors='coerce', format='%Y%m%d% H%M%S%f').unstack()
But this does not work.
Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?
Upvotes: 8
Views: 13461
Reputation: 23
If all these doesn't work out, I have a workaround which sort of works although there are certain assumptions. Here I have assumed that datetime columns already present in the dataset would have 'year' or 'month' or 'date' or 'day' in their name, so further I'm using Regex to bring all of them out.
normzd_column_names=[x.lower() for x in df.columns]
pattern='year|month|date|day'
for col in normzd_column_names:
match=re.search(pattern,col)
if match:
print(f'Datetime column found: {col}')
Upvotes: 0
Reputation: 188
Use different regex formats for dates - This code uses following formats - yyyy/mm/dd+ or yyyy-m-dd+ or yyyy\mm\dd+ or yyyy.mm.dd+.
Below code will find the datetime columns:
def format_date(df):
non_numeric_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object' or is_datetime(df[col])]
if len(non_numeric_cols) > 0:
mask = df.astype(str).apply(lambda x : x.str.match('(\d{2,4}(-|\/|\\|\.| )\d{2}(-|\/|\\|\.| )\d{2,4})+').any())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime,dayfirst=False)
return df
Upvotes: 0
Reputation: 30605
There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.
If you have a df (based on @Alexander's df)
df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00']})
data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]
df = pd.concat(data, axis=1, keys=[s.name for s in data])
or with the help of a mask i.e
mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)
df.types
Output:
col1 object col2 datetime64[ns] col3 int64 col4 datetime64[ns] dtype: object
If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+'
Eg:
ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col5': ['2017-02-04',
'2017-02-04',
'17-02-2004 14:13:00',
'17-02-2014',
'2017-02-03']})
mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)
Output :
col3 col4 col5 0 0 2017-02-04 18:41:00 2017-02-04 00:00:00 1 1 2017-02-04 18:41:00 2017-02-04 00:00:00 2 2 2017-02-04 18:41:00 2004-02-17 14:13:00 3 3 2017-02-04 18:41:00 2014-02-17 00:00:00 4 4 2017-02-03 14:13:00 2017-02-03 00:00:00
Hope it helps
Upvotes: 10
Reputation: 109526
You can identify which columns in your dataframe are of type object
and then only convert those columns to datetime using coerce=True
so that errors are generated for columns which cannot be converted. Use combine_first
to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.
df = pd.DataFrame(
{'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col3': [0, 1, 2, 3, 4]})
object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']
df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
pd.to_datetime, coerce=True))
>>> df
col1 col2 col3
0 A 2017-02-04 18:41:00 0
1 B 2017-02-04 18:41:00 1
2 C 2017-02-04 18:41:00 2
3 D 2017-02-04 18:41:00 3
4 E 2017-02-03 14:13:00 4
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1 5 non-null object
col2 5 non-null datetime64[ns]
col3 5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes
Upvotes: 7