Reputation: 93
I have a strange issue with a Datetime column. Suppose there is a date in a start_date column:
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 641 entries, 9 to 1394
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 641 non-null object
1 start_date 641 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 15.0+ KB
When I set the index to start_date, DatetimeIndex seems to be incomplete:
>>> df2 = df2.set_index('start_date')
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 641 entries, 2020-01-01 to 2020-03-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 641 non-null object
dtypes: object(1)
memory usage: 10.0+ KB
Actually there are more entries in this dataframe:
df3 = df2.copy()
df3 = df3.reset_index()
df3 = df3[pd.to_datetime(df3['start_date']).dt.month > 3]
df3 = df3.set_index('start_date')
df3.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 393 entries, 2020-04-01 to 2020-09-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 393 non-null object
dtypes: object(1)
memory usage: 6.1+ KB
As you can see, there are entries with a date up to 2020-09-01
. But why are these dates only sometimes given? I could not detect a gap or something similar in the index start_date.
Upvotes: 1
Views: 221
Reputation: 59519
When DataFrame.info
prints out the XXX to YYY
information from the index it simply prints out the value of the first index value to the value of the last index value. If your index is not monotonic (can easily check with df.index.is_monotonic
) this doesn't correspond to the full span.
The code responsible for this is Index._summary
, and it's clear it's simply looking at the first value [0]
and last value [-1]
when it summarizes
def _summary(self, name=None) -> str_t:
"""
Return a summarized representation.
Parameters
----------
name : str
name to use in the summary representation
Returns
-------
String with a summarized representation of the index
"""
if len(self) > 0:
head = self[0]
if hasattr(head, "format") and not isinstance(head, str):
head = head.format()
tail = self[-1]
if hasattr(tail, "format") and not isinstance(tail, str):
tail = tail.format()
index_summary = f", {head} to {tail}"
else:
index_summary = ""
Here's a simple example:
import pandas as pd
df = pd.DataFrame(data=[1,1,1], index=pd.to_datetime(['2010-01-01', '2012-01-01', '2011-01-01']))
df.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 3 entries, 2010-01-01 to 2011-01-01
#...
If you want the full range sort
the index before looking at the info:
df.sort_index().info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 3 entries, 2010-01-01 to 2012-01-01
#...
Upvotes: 1