Wiktor Tomczak
Wiktor Tomczak

Reputation: 484

How to replace row values and index label efficiently in Pandas?

I'm looking for an efficient way to repeatedly update rows in a DataFrame. By this I mean changing row values and its index label. I specifically need help with the latter. The best I could find is How to change Pandas dataframe index value? However, this updates the entire index, while I care about a single index label.

Directly assigning to index[n] is not supported:

>>> df.index[1] = 'new_label'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1374, in __setitem__
    raise TypeError("Index does not support mutable operations")

Would it work to modify (assign to) the numpy array underlying the index?

>>> df.index._values[1] = 'new_label'

The updates don't break index sortedness.


More context:

I have a DataFrame indexed by timestamps (DatetimeIndex) where I need to efficiently append new rows in real time (many times per second). I preallocate a large fixed-size DataFrame with NaT/NaN s and I append rows by writing into the next empty row.

Upvotes: 2

Views: 1223

Answers (2)

jezrael
jezrael

Reputation: 862851

As Steven G pointed, this update is very unefficient.

Better is create list of DataFrames and then use concat:

df1 = pd.DataFrame({'a': [1,2,3]}, index=pd.date_range('2015-01-01', periods=3))
print (df1)
            a
2015-01-01  1
2015-01-02  2
2015-01-03  3

df2 = pd.DataFrame({'a': [7,8,9]}, index=pd.date_range('2016-01-01', periods=3))
print (df2)
            a
2016-01-01  7
2016-01-02  8
2016-01-03  9

dfs = [df1,df2]

df = pd.concat(dfs)
print (df)
            a
2015-01-01  1
2015-01-02  2
2015-01-03  3
2016-01-01  7
2016-01-02  8
2016-01-03  9

Then you can concat another DataFrame:

df3 = pd.DataFrame({'a': [3,2,5]}, index=pd.date_range('2017-01-01', periods=3))
print (df3)
            a
2017-01-01  3
2017-01-02  2
2017-01-03  5

df = pd.concat([df, df3])
print (df)
            a
2015-01-01  1
2015-01-02  2
2015-01-03  3
2016-01-01  7
2016-01-02  8
2016-01-03  9
2017-01-01  3
2017-01-02  2
2017-01-03  5

Or use DataFrame.append:

df4 = pd.DataFrame({'a': [3,2,4]}, index=pd.date_range('2018-01-01', periods=3))
print (df4)
            a
2018-01-01  3
2018-01-02  2
2018-01-03  4

df = df.append(df4)
print (df)
            a
2015-01-01  1
2015-01-02  2
2015-01-03  3
2016-01-01  7
2016-01-02  8
2016-01-03  9
2017-01-01  3
2017-01-02  2
2017-01-03  5
2018-01-01  3
2018-01-02  2
2018-01-03  4

Solution for replacing value in index:

You can use very similar as df.index._values[1] = 'new_label' only remove _ and another solution is with Index.set_value:

df = pd.DataFrame({'a': [1,2,3]}, index=pd.date_range('2016-01-01', periods=3))
print (df)
            a
2016-01-01  1
2016-01-02  2
2016-01-03  3

df.index.values[0] = pd.Timestamp(2016,11,23,1,0,0)
df.index.set_value(df.index, df.index[1], pd.Timestamp(2016,11,22,1,0,0))

print (df)
                     a
2016-11-23 01:00:00  1
2016-11-22 01:00:00  2
2016-01-03 00:00:00  3

Upvotes: 1

Steven G
Steven G

Reputation: 17132

updating a dataframe multiple time per second is not very efficient. you should append a list and then transform the list to a dataframe afterward.

such has:

log=list()
for i in range(1,10):   # here is your loop every minutes where log is updated
    log.append([date, value1, value2])

df = pd.DataFrame(log) # now you create the dataframe after being done updating the list.

Upvotes: 1

Related Questions