SlowLearner
SlowLearner

Reputation: 7997

Specifying dataframe index as date at time of creation

I'm a newcomer to Python (3.5) from an R background and I'm struggling a little with the differences in the way dataframes are created and used. In particular I want to create a dataframe using a series of dates for the index. The following experimental code (note the commented out index) works more or less as I expect:

import pandas as pd
import numpy as np

np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
                        'c1': pd.Series(np.random.uniform(10, 20, size=num_periods)),
                        'c2': pd.Series(np.random.uniform(30, 40, size=num_periods))},
                  # index=monthindex,
)
print(dd)

...and gives me this output:

          c1         c2       date
0  11.269698  33.362217 2014-01-01
1  19.667178  34.513765 2014-02-01
2  12.604760  38.402551 2014-03-01
3  18.972365  31.231021 2014-04-01
4  13.767497  35.430262 2014-05-01

...and I can specify the index after creation like this:

dd.index = monthindex
print(dd)

...which gets me this, which looks right:

                   c1         c2       date
2014-01-01  11.269698  33.362217 2014-01-01
2014-02-01  19.667178  34.513765 2014-02-01
2014-03-01  12.604760  38.402551 2014-03-01
2014-04-01  18.972365  31.231021 2014-04-01
2014-05-01  13.767497  35.430262 2014-05-01

But if I uncomment the index call in the code above I get the date in the index but I am left with Na values like this:

            c1  c2       date
2014-01-01 NaN NaN 2014-01-01
2014-02-01 NaN NaN 2014-02-01
2014-03-01 NaN NaN 2014-03-01
2014-04-01 NaN NaN 2014-04-01
2014-05-01 NaN NaN 2014-05-01

I suspect it may be because the two Series objects don't share any values with the index but I don't really understand what is going on.

What is happening and how should I specify a date index during the creation of the dataframe rather than tacking it on after the call to DataFrame?

Upvotes: 0

Views: 65

Answers (2)

Mike Müller
Mike Müller

Reputation: 85512

Using the NumPy arrays directly without creating Series first works:

import pandas as pd
import numpy as np

np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
                        'c1': np.random.uniform(10, 20, size=num_periods),
                        'c2': np.random.uniform(30, 40, size=num_periods)},
                  index=monthindex,
)
print(dd)

Output:

                   c1         c2       date
2014-01-01  11.269698  33.362217 2014-01-01
2014-02-01  19.667178  34.513765 2014-02-01
2014-03-01  12.604760  38.402551 2014-03-01
2014-04-01  18.972365  31.231021 2014-04-01
2014-05-01  13.767497  35.430262 2014-05-01

The Series come with their own indices that do not match the month index. The NumPy arrays don't have an index and use the index you provide.

Upvotes: 1

EdChum
EdChum

Reputation: 394249

Your error here is that by passing Series as the data type you're effectively re-indexing the df to use those indices and it's trying to align with them, if you use just the values then it works:

In [61]:
np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
                        'c1': pd.Series(np.random.uniform(10, 20, size=num_periods)).values,
                        'c2': pd.Series(np.random.uniform(30, 40, size=num_periods)).values},
                   index=monthindex,
)
dd

Out[61]:
                   c1         c2       date
2014-01-01  11.269698  33.362217 2014-01-01
2014-02-01  19.667178  34.513765 2014-02-01
2014-03-01  12.604760  38.402551 2014-03-01
2014-04-01  18.972365  31.231021 2014-04-01
2014-05-01  13.767497  35.430262 2014-05-01

If you compare the difference between monthindex data and the Series:

In [60]:
monthindex

Out[60]:
DatetimeIndex(['2014-01-01', '2014-02-01', '2014-03-01', '2014-04-01',
               '2014-05-01'],
              dtype='datetime64[ns]', freq='MS')

In [59]:
pd.Series(np.random.uniform(10, 20, size=num_periods))

Out[59]:
0    13.730122
1    14.479968
2    11.294407
3    18.598787
4    18.203884
dtype: float64

You can see that the Series type has a default index constructed, this is why you get NaN in those columns, whilst if you access the .values attribute to return a np array you get a flat array with no index:

In [62]:
pd.Series(np.random.uniform(10, 20, size=num_periods)).values

Out[62]:
array([ 13.73012225,  14.47996825,  11.2944068 ,  18.59878707,  18.20388363])

This by the way is by design

Upvotes: 1

Related Questions