Carlos Muñiz
Carlos Muñiz

Reputation: 1918

How to automatically detect columns that contain datetime in a pandas dataframe

I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.

   col1    col2
   -----------------------------------
0  A       2017-02-04 10:41:00.0000000
1  B       2017-02-04 10:41:00.0000000
2  C       2017-02-04 10:41:00.0000000
3  D       2017-02-04 10:41:00.0000000
4  E       2017-02-03 06:13:00.0000000

Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?

I've tried:

df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)

which results in an error:

to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I also tried:

pd.to_datetime(df.stack(),  errors='ignore', format='%Y%m%d% H%M%S%f').unstack()

and

pd.to_datetime(df.stack(),  errors='coerce', format='%Y%m%d% H%M%S%f').unstack()

But this does not work.

Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?

Upvotes: 8

Views: 13461

Answers (4)

Ayan
Ayan

Reputation: 23

If all these doesn't work out, I have a workaround which sort of works although there are certain assumptions. Here I have assumed that datetime columns already present in the dataset would have 'year' or 'month' or 'date' or 'day' in their name, so further I'm using Regex to bring all of them out.

normzd_column_names=[x.lower() for x in df.columns]

pattern='year|month|date|day'
for col in normzd_column_names:
    match=re.search(pattern,col)
    if match:
       print(f'Datetime column found: {col}')

Upvotes: 0

hp_elite
hp_elite

Reputation: 188

Use different regex formats for dates - This code uses following formats - yyyy/mm/dd+ or yyyy-m-dd+ or yyyy\mm\dd+ or yyyy.mm.dd+.

Below code will find the datetime columns:

  def format_date(df):
        non_numeric_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object' or is_datetime(df[col])]
        if len(non_numeric_cols) > 0:
            mask = df.astype(str).apply(lambda x : x.str.match('(\d{2,4}(-|\/|\\|\.| )\d{2}(-|\/|\\|\.| )\d{2,4})+').any())
            df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime,dayfirst=False)
        return df

Upvotes: 0

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.

If you have a df (based on @Alexander's df)

df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
  'col2': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00'],
  'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00']})

data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]

df = pd.concat(data, axis=1, keys=[s.name for s in data])

or with the help of a mask i.e

mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)

df.types

Output:

col1            object
col2    datetime64[ns]
col3             int64
col4    datetime64[ns]
dtype: object

If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+' Eg:

ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-03 14:13:00'],
  'col5': ['2017-02-04',
       '2017-02-04',
       '17-02-2004 14:13:00',
       '17-02-2014',
       '2017-02-03']})

mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)

Output :

   col3                col4                col5
0     0 2017-02-04 18:41:00 2017-02-04 00:00:00
1     1 2017-02-04 18:41:00 2017-02-04 00:00:00
2     2 2017-02-04 18:41:00 2004-02-17 14:13:00
3     3 2017-02-04 18:41:00 2014-02-17 00:00:00
4     4 2017-02-03 14:13:00 2017-02-03 00:00:00

Hope it helps

Upvotes: 10

Alexander
Alexander

Reputation: 109526

You can identify which columns in your dataframe are of type object and then only convert those columns to datetime using coerce=True so that errors are generated for columns which cannot be converted. Use combine_first to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.

df = pd.DataFrame(
     {'col1': ['A', 'B', 'C', 'D', 'E'],
      'col2': ['2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-03 14:13:00'],
      'col3': [0, 1, 2, 3, 4]})

object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']

df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
    pd.to_datetime, coerce=True))
>>> df
  col1                col2  col3
0    A 2017-02-04 18:41:00     0
1    B 2017-02-04 18:41:00     1
2    C 2017-02-04 18:41:00     2
3    D 2017-02-04 18:41:00     3
4    E 2017-02-03 14:13:00     4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1    5 non-null object
col2    5 non-null datetime64[ns]
col3    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes

Upvotes: 7

Related Questions