Reputation: 615
I'm trying to build a DataFrame of stock data, I can get all the data I need, but can only get 1000 data points at a time. So what I want to do is save the initial 1000 data points in a csv file, and then run my program again every now and again, and any new data, I want to append to the old DataFrame. So it needs to check the 'new df' against the 'old df' and append any new rows.
Lets say my 'old df', saved on a csv file, looks like this:
Date Stock A Stock B Stock C Stock D
01/02/19 100.0 87.0 74.0 228.0
02/02/19 101.5 87.5 75.0 227.0
03/02/19 102.0 89.0 76.5 225.5
And I then run my program the next day, and the 'new df' looks like this:
Date Stock A Stock B Stock C Stock D
02/02/19 101.5 87.5 75.0 227.0
03/02/19 102.0 89.0 76.5 225.5
04/02/19 103.0 89.5 77.5 226.0
What I then need is for my program to identify that the last row in the 'new df' isn't in the 'old df' and to append the 'old df' with that recent data, in this case:
04/02/19 103.0 89.5 77.5 226.0
Resulting in the following df, that would then be saved as the 'old df' so I can repeat the process the next day:
Date Stock A Stock B Stock C Stock D
01/02/19 100.0 87.0 74.0 228.0
02/02/19 101.5 87.5 75.0 227.0
03/02/19 102.0 89.0 76.5 225.5
04/02/19 103.0 89.5 77.5 226.0
I imagine the code will have to use some variation of:
old_df.append(new_df)
But with something in there to scan for data that is already in the old_df.
Any help would be much appreciated.
This is my code so far:
import requests
import json
import pandas as pd
import datetime as dt
#total_data = pd.read_csv('1m_bin_db.csv')
#total_data.set_index('Date', inplace=True)
def get_bars(symbol, interval):
url = 'https://api.binance.com/api/v1/klines?symbol=' + symbol + '&interval=' + interval + '&limit=1000'
data = json.loads(requests.get(url).text)
df = pd.DataFrame(data)
df.columns = ['open_time',
'o', 'h', 'l', 'c', 'v',
'close_time', 'qav', 'num_trades',
'taker_base_vol', 'taker_quote_vol', 'ignore']
df.index = [dt.datetime.fromtimestamp(x/1000.0) for x in df.close_time]
return df
coins = ['ADABTC']
dfs = []
for coin in coins:
get_data = get_bars(coin, '1m')
df = get_data[['o', 'h', 'l', 'c', 'v']].add_prefix(coin + '_')
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
dfs.append(df)
prices_1m = pd.concat(dfs, axis=1)
prices_1m.index.name = 'Date'
When I print total_data
from CSV I get:
ADABTC_o ADABTC_h ... ADABTC_c ADABTC_v
Date ...
2019-02-15 12:41:59.999 0.000011 0.000011 ... 0.000011 48805.0
2019-02-15 12:42:59.999 0.000011 0.000011 ... 0.000011 837.0
2019-02-15 12:43:59.999 0.000011 0.000011 ... 0.000011 19430.0
2019-02-15 12:44:59.999 0.000011 0.000011 ... 0.000011 15319.0
2019-02-15 12:45:59.999 0.000011 0.000011 ... 0.000011 769414.0
When I print prices_1m
I get:
ADABTC_o ADABTC_h ... ADABTC_c ADABTC_v
Date ...
2019-02-15 12:43:59.999 0.000011 0.000011 ... 0.000011 19430.0
2019-02-15 12:44:59.999 0.000011 0.000011 ... 0.000011 15319.0
2019-02-15 12:45:59.999 0.000011 0.000011 ... 0.000011 773414.0
2019-02-15 12:46:59.999 0.000011 0.000011 ... 0.000011 7449.0
2019-02-15 12:47:59.999 0.000011 0.000011 ... 0.000011 0.0
So all I want to do is stick the last two rows on the bottom of total_data
and I did:
df = total_data.append(prices_1m).drop_duplicates()
Resulting in the following:
ADABTC_o ADABTC_h ... ADABTC_c ADABTC_v
Date ...
2019-02-15 12:43:59.999 0.000011 0.000011 ... 0.000011 19430.0
2019-02-15 12:44:59.999 0.000011 0.000011 ... 0.000011 15319.0
2019-02-15 12:45:59.999 0.000011 0.000011 ... 0.000011 769414.0
2019-02-15 12:45:59.999000 0.000011 0.000011 ... 0.000011 773414.0
2019-02-15 12:46:59.999000 0.000011 0.000011 ... 0.000011 7449.0
So I think the problem is the data is 'live' essentially, so as 12:45:59.999 was the last data point in total_data
I probably got that data with 10 seconds left to go of the 60 second data period. So in prices_1m
the 12:45:59.999 data point is updates fully, which explains the difference between the duplicate times but different 'V' column. So I feel like we're almost there, but I want the prices_1m
to have priority over total_data
, so the most recent data appended on to total_data
.
2019-02-15 12:45:59.999 0.000011 0.000011 ... 0.000011 773414.0
So I want that line to be the entry for 2019-02-15 12:45:59.999
, and then continue appending from there.
I get this when I do print(total_data.index):
Index(['2019-02-14 20:06:59.999', '2019-02-14 20:07:59.999',
'2019-02-14 20:08:59.999', '2019-02-14 20:09:59.999',
'2019-02-14 20:10:59.999', '2019-02-14 20:11:59.999',
'2019-02-14 20:12:59.999', '2019-02-14 20:13:59.999',
'2019-02-14 20:14:59.999', '2019-02-14 20:15:59.999',
...
'2019-02-15 12:36:59.999', '2019-02-15 12:37:59.999',
'2019-02-15 12:38:59.999', '2019-02-15 12:39:59.999',
'2019-02-15 12:40:59.999', '2019-02-15 12:41:59.999',
'2019-02-15 12:42:59.999', '2019-02-15 12:43:59.999',
'2019-02-15 12:44:59.999', '2019-02-15 12:45:59.999'],
dtype='object', name='Date', length=1000)
Upvotes: 2
Views: 2969
Reputation: 863531
I believe you need working with DatetimeIndex
and concat
, not with merge
by date
column:
coins = ['ADABTC']
dfs = []
for coin in coins:
get_data = get_bars(coin, '1m')
df = get_data[['o', 'h', 'l', 'c', 'v']].add_prefix(coin + '_')
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
dfs.append(df)
prices_1m = pd.concat(dfs, axis=1)
prices_1m.to_csv('1m_bin_db.csv')
And then:
total_data.index = pd.to_datetime(total_data.index)
df = total_data.append(prices_1m)
df = df[~df.index.duplicated(keep='last')]
Upvotes: 4
Reputation: 75140
Use append
with drop_duplicates()
if date is not an index.
old_df.append(new_df).drop_duplicates('Date')
If data change might be possible and you want latest values to remain:
df.append(df1).sort_values('Date',ascending=False).drop_duplicates('Date')
Upvotes: 3