Reputation: 63
I have a Pandas dataframe and I am continually appending a row of data each second as below.
df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
sensor1 sensor2 sensor3
2015-04-14 08:50:23 5.4 5.6 5.7
2015-04-14 08:50:24 5.5 5.6 5.8
2015-04-14 08:50:26 5.2 5.3 5.4
If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).
I only need to keep X rows of the data. i.e. after the operation, it will be:
>>>df
sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24 5.5 5.6 5.8
2015-04-14 08:50:26 5.2 5.3 5.4
2015-04-14 08:50:27 5.2 5.4 5.6
Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?
Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578
Upvotes: 5
Views: 22317
Reputation: 109546
This example initializes a DataFrame equal to the max size and fills it with Nones. It then iterates over a list of new rows, first shifting the original DataFrame and then appending the new row to the end. You didn't specify how you wanted to treat the index, so I ignored it.
max_rows = 5
cols = list('AB')
# Initialize empty DataFrame
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in cols})
new_rows = [pd.DataFrame({'A': [1], 'B': [10]}),
pd.DataFrame({'A': [2], 'B': [11]}),
pd.DataFrame({'A': [3], 'B': [12]}),
pd.DataFrame({'A': [4], 'B': [13]}),
pd.DataFrame({'A': [5], 'B': [14]}),
pd.DataFrame({'A': [6], 'B': [15]}),
pd.DataFrame({'A': [7], 'B': [16]})]
for row in new_rows:
df = df.shift(-1)
df.iloc[-1, :] = row.values
>>> df
df
A B
0 3 12
1 4 13
2 5 14
3 6 15
4 7 16
Let's use a real example with one year of stock prices for AAPL.
from datetime import timedelta
aapl = DataReader("AAPL", data_source="yahoo", start="2014-1-1", end="2015-1-1")
cols = aapl.columns
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in aapl.columns})[cols]
# Initialize a datetime index
df.index = pd.DatetimeIndex(end=aapl.index[0] + timedelta(days=-1), periods=max_rows, freq='D')
for timestamp, row in aapl.iterrows():
df = df.shift(-1)
df.iloc[-1, :] = row.values
idx = df.index[:-1].tolist()
idx.append(timestamp)
df.index = idx
>>> df
Open High Low Close Volume Adj Close
2013-12-28 112.58 112.71 112.01 112.01 1.44796e+07 111.57
2013-12-29 112.1 114.52 112.01 113.99 3.3721e+07 113.54
2013-12-30 113.79 114.77 113.7 113.91 2.75989e+07 113.46
2013-12-31 113.64 113.92 112.11 112.52 2.98815e+07 112.08
2014-12-31 112.82 113.13 110.21 110.38 4.14034e+07 109.95
Upvotes: 1
Reputation: 10308
pandas
stores data in arrays. Doing the sort of operation you want inherently requires a copy for an array data structure. Since data is stored in contiguous (or strided) memory, adding something to the end and removing something from the beginning requires copying everything to a new region of memory. There is no way around this. You need to use a different data structure.
Edit: Thinking about this a bit more, I see two approaches to do this.
The simplest and most straightforward would be to use a collections.deque
of tuples. You can just append a new tuple to the end, and if it gets too full it will dump the corresponding on from the beginning. At the end, you can just convert them into a DataFrame
. I am just using the for
loop as an example, I gather you get your data in a different way. It wouldn't matter:
import pandas as pd
from collections import deque
maxlen = 1000
dq = deque(maxlen=maxlen)
for reading1, reading3, reading3 in readings:
dq.append(pd.Series([reading1, reading2, reading3],
index=['sensor1', 'sensor2', 'sensor3'],
name=time.strftime("%Y-%m-%d %H:%M:%S")))
df = pd.concat(dq, axis=1).T
The second approach is to use a DataFrame
of a fixed size, and use the modulo of the maximum length to choose the place to overwrite, but also keep the item number in the DataFrame
. Then you can sort by item number. In your case, you could conceivably sort by time, but this approach is more general. As with the previous example, I will use a for
loop to demonstrate, but you probably don't have one. Further, I will also assume that you don't have a real iterable you can enumerate
, if you do then you don't have to keep track of the index number as I do here:
import pandas as pd
maxlen = 1000
df = pd.DataFrame(np.full((maxlen, 5), np.nan),
columns=['index', 'time',
'sensor1', 'sensor2', 'sensor3'])
i = 0
for reading1, reading3, reading3 in readings:
df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
reading1, reading2, reading3]
i+=1
df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)
Upvotes: 2
Reputation: 11948
One way would be to pre-allocate the rows, and replace the values cyclically.
# Say we to limit to a thousand rows
N = 1000
# Create the DataFrame with N rows and 5 columns -- all NaNs
data = pd.DataFrame(pd.np.empty((N, 5)) * pd.np.nan)
# To check the length of the DataFrame, we'll need to .dropna().
len(data.dropna()) # Returns 0
# Keep a running counter of the next index to insert into
counter = 0
# Insertion always happens at that counter
data.loc[counter, :] = pd.np.random.rand(5)
# ... and increment the counter, but when it exceeds N, set it to 0
counter = (counter + 1) % N
# Now, the DataFrame contains one row
len(data.dropna()) # Returns 1
# We can add several rows one after another. Let's add twice as many as N
for row in pd.np.random.rand(2 * N, 5):
data.loc[counter, :] = row
counter = (counter + 1) % N
# Now that we added them, we still have only the last N rows
len(data) # Returns N
This avoids the need to modify the data in any way, and would be a fast approach to appending the data. However, to reading from the data can be slower if:
data
using counter
to extract the original order.N
, you'll need .dropna()
(or count the total inserted rows) to remove the unused ones.In most of the scenarios that I deal with where truncated append performance matters, neither of the above are true, but your scenario may be different. In that case, @Alexander has a good solution involving .shift()
.
Upvotes: 3