Pandas - Strange issue with datetime column

Question

I have a strange issue with a Datetime column. Suppose there is a date in a start_date column:

>>> df2.info()


Int64Index: 641 entries, 9 to 1394
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   number      641 non-null    object        
 1   start_date  641 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 15.0+ KB

When I set the index to start_date, DatetimeIndex seems to be incomplete:

>>> df2 = df2.set_index('start_date')
>>> df2.info()


DatetimeIndex: 641 entries, 2020-01-01 to 2020-03-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   number  641 non-null    object
dtypes: object(1)
memory usage: 10.0+ KB

Actually there are more entries in this dataframe:

df3 = df2.copy()
df3 = df3.reset_index()
df3 = df3[pd.to_datetime(df3['start_date']).dt.month > 3]
df3 = df3.set_index('start_date')
df3.info()


DatetimeIndex: 393 entries, 2020-04-01 to 2020-09-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   number  393 non-null    object
dtypes: object(1)
memory usage: 6.1+ KB

As you can see, there are entries with a date up to 2020-09-01. But why are these dates only sometimes given? I could not detect a gap or something similar in the index start_date.

ALollz · Accepted Answer

When DataFrame.info prints out the XXX to YYY information from the index it simply prints out the value of the first index value to the value of the last index value. If your index is not monotonic (can easily check with df.index.is_monotonic) this doesn't correspond to the full span.

The code responsible for this is Index._summary, and it's clear it's simply looking at the first value [0] and last value [-1] when it summarizes

def _summary(self, name=None) -> str_t:
    """
    Return a summarized representation.
    Parameters
    ----------
    name : str
        name to use in the summary representation
    Returns
    -------
    String with a summarized representation of the index
    """
    if len(self) > 0:
        head = self[0]
        if hasattr(head, "format") and not isinstance(head, str):
            head = head.format()
        tail = self[-1]
        if hasattr(tail, "format") and not isinstance(tail, str):
            tail = tail.format()
        index_summary = f", {head} to {tail}"
    else:
        index_summary = ""

Here's a simple example:

import pandas as pd

df = pd.DataFrame(data=[1,1,1], index=pd.to_datetime(['2010-01-01', '2012-01-01', '2011-01-01']))

df.info()

#
#DatetimeIndex: 3 entries, 2010-01-01 to 2011-01-01
#...

If you want the full range sort the index before looking at the info:

df.sort_index().info()
#
#DatetimeIndex: 3 entries, 2010-01-01 to 2012-01-01
#...

Pandas - Strange issue with datetime column

Answers (1)

Related Questions