Reputation: 3046
I have been searching if I could create dummies using the date
indexed in pandas
, but could not find anything yet.
I have a df
that is indexed by date
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
...
I know I could setdate
as a column using,
df.reset_index(level=0, inplace=True)
and then use something like this to create dummies,
df['main_hours'] = np.where((df['date'] >= '2010-01-02 03:00:00') & (df['date'] <= '2010-01-02 05:00:00')1,0)
However, I would like to create dummy variables using indexed date
on the fly without using date
as a column. Is there a way in pandas
like that?
Any suggestion would be appreciated.
Upvotes: 2
Views: 2076
Reputation: 109696
df = df.assign(main_hours=0)
df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1
>>> df
dew temp main_hours
2010-01-02 00:00:00 129 -16 0
2010-01-02 01:00:00 148 -15 0
2010-01-02 02:00:00 159 -11 0
2010-01-02 03:00:00 181 -7 1
2010-01-02 04:00:00 138 -7 1
Upvotes: 1
Reputation: 210952
IIUC:
df['main_hours'] = \
np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'),
1,
0)
or:
In [8]: df['main_hours'] = \
((df.index >= '2010-01-02 03:00:00') &
(df.index <= '2010-01-02 05:00:00')).astype(int)
In [9]: df
Out[9]:
dew temp main_hours
date
2010-01-02 00:00:00 129.0 -16 0
2010-01-02 01:00:00 148.0 -15 0
2010-01-02 02:00:00 159.0 -11 0
2010-01-02 03:00:00 181.0 -7 1
2010-01-02 04:00:00 138.0 -7 1
Timing: for 50.000 rows DF:
In [19]: df = pd.concat([df.reset_index()] * 10**4, ignore_index=True).set_index('date')
In [20]: pd.options.display.max_rows = 10
In [21]: df
Out[21]:
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
... ... ...
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
[50000 rows x 2 columns]
In [22]: %timeit ((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00')).astype(int)
1.58 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: %timeit np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'), 1, 0)
1.52 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: df.shape
Out[24]: (50000, 2)
Upvotes: 2
Reputation: 323366
Or using between
;
pd.Series(df.index).between('2010-01-02 03:00:00', '2010-01-02 05:00:00', inclusive=True).astype(int)
Out[1567]:
0 0
1 0
2 0
3 1
4 1
Name: date, dtype: int32
Upvotes: 2