Reputation: 1769
I have the Data2 column in my dataframe. I am trying to create a new column ('NewCol') by applying a filter to the Data2 column. Below code works and the results of the new column is correct. But I get the below error message when running the code. How can I fix this? I would think this impacts performance.
C:\Python27\lib\site-packages\IPython\kernel__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# In[1]:
import pandas as pd
import numpy as np
from pandas import DataFrame
# In[2]:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
# In[4]:
df['NewCol'] = ''
df['NewCol'][df['Data2']> 60] = 'True'
df
Upvotes: 0
Views: 1268
Reputation: 4499
Try using .loc
df.loc[df['Data2']> 60, 'NewCol'] = 'True'
Pandas is very efficient in memory management. For most operations (filters) it returns reference to data already existing in memory (DataFrame). However in some cases it has to make copy and return this. Any assignment on this copy will not reflect in original DataFrame. Hence the warning.
Also for all slicing try to use .loc
if slicing based index values and .iloc
for slicing based on integer locations. In some cases this is faster as explained in documentation
When slicing using dfmi['one']['second']
... dfmi['one'] selects the first level of the columns and returns a data frame that is singly-indexed. Then another python operation dfmi_with_one['second'] selects the series indexed by 'second' happens. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to getitem, so it has to treat them as linear operations, they happen one after another.Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to getitem. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.
Upvotes: 1