Reputation: 7997
I'm a newcomer to Python (3.5) from an R background and I'm struggling a little with the differences in the way dataframes are created and used. In particular I want to create a dataframe using a series of dates for the index. The following experimental code (note the commented out index
) works more or less as I expect:
import pandas as pd
import numpy as np
np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
'c1': pd.Series(np.random.uniform(10, 20, size=num_periods)),
'c2': pd.Series(np.random.uniform(30, 40, size=num_periods))},
# index=monthindex,
)
print(dd)
...and gives me this output:
c1 c2 date
0 11.269698 33.362217 2014-01-01
1 19.667178 34.513765 2014-02-01
2 12.604760 38.402551 2014-03-01
3 18.972365 31.231021 2014-04-01
4 13.767497 35.430262 2014-05-01
...and I can specify the index after creation like this:
dd.index = monthindex
print(dd)
...which gets me this, which looks right:
c1 c2 date
2014-01-01 11.269698 33.362217 2014-01-01
2014-02-01 19.667178 34.513765 2014-02-01
2014-03-01 12.604760 38.402551 2014-03-01
2014-04-01 18.972365 31.231021 2014-04-01
2014-05-01 13.767497 35.430262 2014-05-01
But if I uncomment the index
call in the code above I get the date in the index but I am left with Na values like this:
c1 c2 date
2014-01-01 NaN NaN 2014-01-01
2014-02-01 NaN NaN 2014-02-01
2014-03-01 NaN NaN 2014-03-01
2014-04-01 NaN NaN 2014-04-01
2014-05-01 NaN NaN 2014-05-01
I suspect it may be because the two Series
objects don't share any values with the index but I don't really understand what is going on.
What is happening and how should I specify a date index during the creation of the dataframe rather than tacking it on after the call to DataFrame
?
Upvotes: 0
Views: 65
Reputation: 85512
Using the NumPy arrays directly without creating Series first works:
import pandas as pd
import numpy as np
np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
'c1': np.random.uniform(10, 20, size=num_periods),
'c2': np.random.uniform(30, 40, size=num_periods)},
index=monthindex,
)
print(dd)
Output:
c1 c2 date
2014-01-01 11.269698 33.362217 2014-01-01
2014-02-01 19.667178 34.513765 2014-02-01
2014-03-01 12.604760 38.402551 2014-03-01
2014-04-01 18.972365 31.231021 2014-04-01
2014-05-01 13.767497 35.430262 2014-05-01
The Series come with their own indices that do not match the month index. The NumPy arrays don't have an index and use the index you provide.
Upvotes: 1
Reputation: 394249
Your error here is that by passing Series
as the data type you're effectively re-indexing the df to use those indices and it's trying to align with them, if you use just the values
then it works:
In [61]:
np.random.seed(123456)
num_periods=5
monthindex=pd.date_range('1/1/2014', periods=num_periods, freq='MS')
dd = pd.DataFrame(data={'date':monthindex,
'c1': pd.Series(np.random.uniform(10, 20, size=num_periods)).values,
'c2': pd.Series(np.random.uniform(30, 40, size=num_periods)).values},
index=monthindex,
)
dd
Out[61]:
c1 c2 date
2014-01-01 11.269698 33.362217 2014-01-01
2014-02-01 19.667178 34.513765 2014-02-01
2014-03-01 12.604760 38.402551 2014-03-01
2014-04-01 18.972365 31.231021 2014-04-01
2014-05-01 13.767497 35.430262 2014-05-01
If you compare the difference between monthindex
data and the Series
:
In [60]:
monthindex
Out[60]:
DatetimeIndex(['2014-01-01', '2014-02-01', '2014-03-01', '2014-04-01',
'2014-05-01'],
dtype='datetime64[ns]', freq='MS')
In [59]:
pd.Series(np.random.uniform(10, 20, size=num_periods))
Out[59]:
0 13.730122
1 14.479968
2 11.294407
3 18.598787
4 18.203884
dtype: float64
You can see that the Series
type has a default index constructed, this is why you get NaN
in those columns, whilst if you access the .values
attribute to return a np array you get a flat array with no index:
In [62]:
pd.Series(np.random.uniform(10, 20, size=num_periods)).values
Out[62]:
array([ 13.73012225, 14.47996825, 11.2944068 , 18.59878707, 18.20388363])
This by the way is by design
Upvotes: 1