SynackSA
SynackSA

Reputation: 889

Complex Condition in vectorization of Pandas

I'm fairly new to Python and Pandas, so I'm still learning. I have a Dataframe that contains a bunch of OHLCV data, which is loaded into a Dataframe

                                          Timestamp       Open       High     Low   Close      Volume  Trades
Timestamp                                                                                                    
2015-08-07 14:03:00+00:00 2015-08-07 14:03:00+00:00    3.00000    3.00000    3.00    3.00   81.857278       2
2015-08-07 17:19:00+00:00 2015-08-07 17:19:00+00:00    3.00001    3.00001    3.00    3.00   42.073291       2
2015-08-08 06:43:00+00:00 2015-08-08 06:43:00+00:00    3.00000    3.00000    3.00    3.00    0.400000       1
2015-08-08 09:31:00+00:00 2015-08-08 09:31:00+00:00    2.00000    2.00000    2.00    2.00  125.000000       2
2015-08-08 16:30:00+00:00 2015-08-08 16:30:00+00:00    1.20000    1.20000    1.20    1.20   54.759700       1
...                                             ...        ...        ...     ...     ...         ...     ...
2020-12-31 23:55:00+00:00 2020-12-31 23:55:00+00:00  738.49000  738.49000  738.49  738.49    0.748789       3
2020-12-31 23:56:00+00:00 2020-12-31 23:56:00+00:00  738.07000  738.07000  737.72  737.72    2.491733       8
2020-12-31 23:57:00+00:00 2020-12-31 23:57:00+00:00  738.15000  738.15000  737.94  738.05   56.043875       9
2020-12-31 23:58:00+00:00 2020-12-31 23:58:00+00:00  738.14000  738.15000  737.55  737.75   80.826279      16
2020-12-31 23:59:00+00:00 2020-12-31 23:59:00+00:00  737.01000  737.60000  737.01  737.45    3.129885       8

As you can see, there are missing minutes between some of the entries, due to no actual volume happening in that minute. What I'm trying to do, is create a new column that has the previous minutes volume, if there is no row for the previous minute, then that value should be 0.

I was trying to use the following statement:

df.loc[
    (df['Timestamp'].shift().notna() == True) & 
    (
        ((df['Timestamp'].shift().dt.minute.astype(int) == (df['Timestamp'].dt.minute.astype(int) - 1)) & (df['Timestamp'].dt.minute.astype(int) >= 2)) |
        ((df['Timestamp'].shift().dt.minute.astype(int) == 59) & (df['Timestamp'].dt.minute.astype(int) >= 2)) 
    ), 'previousVolume'] = df['Volume'].shift().astype(float)

For adding 0s I would check for the negative of the above.

The error that I get is the following:

Traceback (most recent call last):
  File "engulfing_test.py", line 145, in <module>
    ), 'previousVolume'] = df['Volume'].shift().astype(float)
  File "/Users/username/Projects/Personal/crypto-bot/env/lib/python3.7/site-packages/pandas/core/generic.py", line 5877, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/Users/username/Projects/Personal/crypto-bot/env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/Users/username/Projects/Personal/crypto-bot/env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/username/Projects/Personal/crypto-bot/env/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/Users/username/Projects/Personal/crypto-bot/env/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1068, in astype_nansafe
    raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

I know complex statements like this work, cause doing something like this works:

df.loc[(df['Timestamp'].dt.minute.astype(int) > 10) & (df['Timestamp'].dt.minute.astype(int) < 57), 'gt10lt57'] = 'Yes'

Any help greatly appreciated

EDIT

Adding output of print(df.head().to_dict())

{'Timestamp': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): Timestamp('2015-08-07 14:03:00+0000', tz='UTC'), Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): Timestamp('2015-08-07 17:19:00+0000', tz='UTC'), Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): Timestamp('2015-08-08 06:43:00+0000', tz='UTC'), Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): Timestamp('2015-08-08 09:31:00+0000', tz='UTC'), Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): Timestamp('2015-08-08 16:30:00+0000', tz='UTC')}, 'Open': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 3.00001, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 2.0, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 1.2}, 'High': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 3.00001, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 2.0, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 1.2}, 'Low': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 2.0, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 1.2}, 'Close': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 3.0, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 2.0, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 1.2}, 'Volume': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 81.85727776, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 42.07329055, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 0.4, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 125.0, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 54.7597}, 'Trades': {Timestamp('2015-08-07 14:03:00+0000', tz='UTC'): 2, Timestamp('2015-08-07 17:19:00+0000', tz='UTC'): 2, Timestamp('2015-08-08 06:43:00+0000', tz='UTC'): 1, Timestamp('2015-08-08 09:31:00+0000', tz='UTC'): 2, Timestamp('2015-08-08 16:30:00+0000', tz='UTC'): 1}}

Upvotes: 2

Views: 133

Answers (1)

piRSquared
piRSquared

Reputation: 294228

asfreq

First, get rid of the duplicate Timestamp column that is identical to the index.
Leave the index where it is. asfreq will change the frequency of the index.

df.drop('Timestamp', axis=1).asfreq('min', fill_value=0)

                             Open    High     Low   Close     Volume  Trades
Timestamp                                                                   
2015-08-07 14:03:00+00:00    3.00    3.00    3.00    3.00  81.857278       2
2015-08-07 14:04:00+00:00    0.00    0.00    0.00    0.00   0.000000       0
2015-08-07 14:05:00+00:00    0.00    0.00    0.00    0.00   0.000000       0
2015-08-07 14:06:00+00:00    0.00    0.00    0.00    0.00   0.000000       0
2015-08-07 14:07:00+00:00    0.00    0.00    0.00    0.00   0.000000       0
...                           ...     ...     ...     ...        ...     ...
2020-12-31 23:55:00+00:00  738.49  738.49  738.49  738.49   0.748789       3
2020-12-31 23:56:00+00:00  738.07  738.07  737.72  737.72   2.491733       8
2020-12-31 23:57:00+00:00  738.15  738.15  737.94  738.05  56.043875       9
2020-12-31 23:58:00+00:00  738.14  738.15  737.55  737.75  80.826279      16
2020-12-31 23:59:00+00:00  737.01  737.60  737.01  737.45   3.129885       8

[2841717 rows x 6 columns]

Upvotes: 2

Related Questions