top bantz
top bantz

Reputation: 615

Appending a pandas dataframe with new data

I'm trying to build a DataFrame of stock data, I can get all the data I need, but can only get 1000 data points at a time. So what I want to do is save the initial 1000 data points in a csv file, and then run my program again every now and again, and any new data, I want to append to the old DataFrame. So it needs to check the 'new df' against the 'old df' and append any new rows.

Lets say my 'old df', saved on a csv file, looks like this:

Date     Stock A  Stock B  Stock C  Stock D
01/02/19 100.0    87.0     74.0     228.0
02/02/19 101.5    87.5     75.0     227.0
03/02/19 102.0    89.0     76.5     225.5

And I then run my program the next day, and the 'new df' looks like this:

Date     Stock A  Stock B  Stock C  Stock D
02/02/19 101.5    87.5     75.0     227.0
03/02/19 102.0    89.0     76.5     225.5
04/02/19 103.0    89.5     77.5     226.0

What I then need is for my program to identify that the last row in the 'new df' isn't in the 'old df' and to append the 'old df' with that recent data, in this case:

04/02/19 103.0    89.5     77.5     226.0

Resulting in the following df, that would then be saved as the 'old df' so I can repeat the process the next day:

Date     Stock A  Stock B  Stock C  Stock D
01/02/19 100.0    87.0     74.0     228.0
02/02/19 101.5    87.5     75.0     227.0
03/02/19 102.0    89.0     76.5     225.5
04/02/19 103.0    89.5     77.5     226.0

I imagine the code will have to use some variation of:

old_df.append(new_df)

But with something in there to scan for data that is already in the old_df.

Any help would be much appreciated.

This is my code so far:

import requests
import json
import pandas as pd
import datetime as dt

#total_data = pd.read_csv('1m_bin_db.csv')
#total_data.set_index('Date', inplace=True)

def get_bars(symbol, interval):
   url = 'https://api.binance.com/api/v1/klines?symbol=' + symbol + '&interval=' + interval + '&limit=1000'
   data = json.loads(requests.get(url).text)
   df = pd.DataFrame(data)
   df.columns = ['open_time',
                 'o', 'h', 'l', 'c', 'v',
                 'close_time', 'qav', 'num_trades',
                 'taker_base_vol', 'taker_quote_vol', 'ignore']
   df.index = [dt.datetime.fromtimestamp(x/1000.0) for x in df.close_time]
   return df

coins = ['ADABTC']

dfs = []
for coin in coins:
    get_data = get_bars(coin, '1m')
    df = get_data[['o', 'h', 'l', 'c', 'v']].add_prefix(coin + '_')
    df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
    dfs.append(df)

prices_1m = pd.concat(dfs, axis=1)
prices_1m.index.name = 'Date'

When I print total_data from CSV I get:

                         ADABTC_o  ADABTC_h    ...     ADABTC_c  ADABTC_v
Date                                           ...                       
2019-02-15 12:41:59.999  0.000011  0.000011    ...     0.000011   48805.0
2019-02-15 12:42:59.999  0.000011  0.000011    ...     0.000011     837.0
2019-02-15 12:43:59.999  0.000011  0.000011    ...     0.000011   19430.0
2019-02-15 12:44:59.999  0.000011  0.000011    ...     0.000011   15319.0
2019-02-15 12:45:59.999  0.000011  0.000011    ...     0.000011  769414.0

When I print prices_1m I get:

                         ADABTC_o  ADABTC_h    ...     ADABTC_c  ADABTC_v
Date                                           ...                       
2019-02-15 12:43:59.999  0.000011  0.000011    ...     0.000011   19430.0
2019-02-15 12:44:59.999  0.000011  0.000011    ...     0.000011   15319.0
2019-02-15 12:45:59.999  0.000011  0.000011    ...     0.000011  773414.0
2019-02-15 12:46:59.999  0.000011  0.000011    ...     0.000011    7449.0
2019-02-15 12:47:59.999  0.000011  0.000011    ...     0.000011       0.0

So all I want to do is stick the last two rows on the bottom of total_data and I did:

df = total_data.append(prices_1m).drop_duplicates()

Resulting in the following:

                            ADABTC_o  ADABTC_h    ...     ADABTC_c  ADABTC_v
Date                                              ...                       
2019-02-15 12:43:59.999     0.000011  0.000011    ...     0.000011   19430.0
2019-02-15 12:44:59.999     0.000011  0.000011    ...     0.000011   15319.0
2019-02-15 12:45:59.999     0.000011  0.000011    ...     0.000011  769414.0
2019-02-15 12:45:59.999000  0.000011  0.000011    ...     0.000011  773414.0
2019-02-15 12:46:59.999000  0.000011  0.000011    ...     0.000011    7449.0

So I think the problem is the data is 'live' essentially, so as 12:45:59.999 was the last data point in total_data I probably got that data with 10 seconds left to go of the 60 second data period. So in prices_1m the 12:45:59.999 data point is updates fully, which explains the difference between the duplicate times but different 'V' column. So I feel like we're almost there, but I want the prices_1m to have priority over total_data, so the most recent data appended on to total_data.

2019-02-15 12:45:59.999  0.000011  0.000011    ...     0.000011  773414.0

So I want that line to be the entry for 2019-02-15 12:45:59.999, and then continue appending from there.

I get this when I do print(total_data.index):

Index(['2019-02-14 20:06:59.999', '2019-02-14 20:07:59.999',
       '2019-02-14 20:08:59.999', '2019-02-14 20:09:59.999',
       '2019-02-14 20:10:59.999', '2019-02-14 20:11:59.999',
       '2019-02-14 20:12:59.999', '2019-02-14 20:13:59.999',
       '2019-02-14 20:14:59.999', '2019-02-14 20:15:59.999',
       ...
       '2019-02-15 12:36:59.999', '2019-02-15 12:37:59.999',
       '2019-02-15 12:38:59.999', '2019-02-15 12:39:59.999',
       '2019-02-15 12:40:59.999', '2019-02-15 12:41:59.999',
       '2019-02-15 12:42:59.999', '2019-02-15 12:43:59.999',
       '2019-02-15 12:44:59.999', '2019-02-15 12:45:59.999'],
      dtype='object', name='Date', length=1000)

Upvotes: 2

Views: 2969

Answers (2)

jezrael
jezrael

Reputation: 863531

I believe you need working with DatetimeIndex and concat, not with merge by date column:

coins = ['ADABTC']

dfs = []
for coin in coins:
    get_data = get_bars(coin, '1m')
    df = get_data[['o', 'h', 'l', 'c', 'v']].add_prefix(coin + '_')
    df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
    dfs.append(df)

prices_1m = pd.concat(dfs, axis=1)
prices_1m.to_csv('1m_bin_db.csv')

And then:

total_data.index = pd.to_datetime(total_data.index)

df = total_data.append(prices_1m)
df = df[~df.index.duplicated(keep='last')]

Upvotes: 4

anky
anky

Reputation: 75140

Use append with drop_duplicates() if date is not an index.

old_df.append(new_df).drop_duplicates('Date')

If data change might be possible and you want latest values to remain:

df.append(df1).sort_values('Date',ascending=False).drop_duplicates('Date')

Upvotes: 3

Related Questions