fe ner
fe ner

Reputation: 1769

Pandas Indexing vs Copy Error

I have the Data2 column in my dataframe. I am trying to create a new column ('NewCol') by applying a filter to the Data2 column. Below code works and the results of the new column is correct. But I get the below error message when running the code. How can I fix this? I would think this impacts performance.

C:\Python27\lib\site-packages\IPython\kernel__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

# In[1]:

import pandas as pd
import numpy as np
from pandas import DataFrame


# In[2]:

df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})


# In[4]:

df['NewCol'] = ''
df['NewCol'][df['Data2']> 60] = 'True'
df

Upvotes: 0

Views: 1268

Answers (1)

shanmuga
shanmuga

Reputation: 4499

Try using .loc

df.loc[df['Data2']> 60, 'NewCol'] = 'True'

Pandas is very efficient in memory management. For most operations (filters) it returns reference to data already existing in memory (DataFrame). However in some cases it has to make copy and return this. Any assignment on this copy will not reflect in original DataFrame. Hence the warning.

Also for all slicing try to use .loc if slicing based index values and .iloc for slicing based on integer locations. In some cases this is faster as explained in documentation

When slicing using dfmi['one']['second']
... dfmi['one'] selects the first level of the columns and returns a data frame that is singly-indexed. Then another python operation dfmi_with_one['second'] selects the series indexed by 'second' happens. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to getitem, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to getitem. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

Upvotes: 1

Related Questions