Reputation: 16661
I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, 'e'
, to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
How can I add column e
to the above example?
Upvotes: 1336
Views: 2787691
Reputation: 23011
If the indices match, simple assignment does the job.
df['e'] = new_series
df = df.assign(e=new_series)
df[['e', 'f']] = new_dataframe
df = df.assign(**new_dataframe) # can assign multiple columns by unpacking
If the above throws a SettingWithCopyWarning
, enable Copy-on-Write before the assignment:
pd.set_option('mode.copy_on_write', True)
df['e'] = new_series
If the indices don't match (as in the OP), then relabeling the index of the new column(s) to be the same as the original dataframe using set_axis()
and assigning does the job. You can also use .assign()
, concat()
(as well as join()
for multiple columns).
for a single column:
df['e'] = new_column.set_axis(df.index)
# ^^^^^^^^ <--- relabel index
df = df.assign(e=new_column.set_axis(df.index))
for multiple columns (2 columns in the below example):
df[['e', 'f']] = new_columns.set_axis(df.index)
df = df.assign(**new_columns.set_axis(df.index).set_axis(['e', 'f'], axis=1))
# ^^^^ relabel index ^^^^^^ relabel columns
df = pd.concat((df, new_columns.set_axis(df.index)), axis=1)
df = df.join(new_columns.set_axis(df.index))
This method is particularly useful when assigning multiple columns where the dtypes are mixed, in which case converting to numpy ndarray using values
or to_numpy()
mangles the dtypes which you probably want to avoid.
df1 = pd.DataFrame({'a': range(3), 'b': [*'abc']}, index=[2,3,5])
df2 = pd.DataFrame({'c': [10, 20, 30], 'd':[0.5, 1.5, 2.5]}) # column 'c' dtype is int
df1[['c', 'd']] = df2.values # now column 'c' dtype is float (because dtype of 'd' is float)
df1 = df1.join(df2.set_axis(df1.index)) # dtypes are preserved
df1[['c', 'd']] = df2.set_axis(df1.index) # dtypes are preserved
Upvotes: 2
Reputation: 173
There are 4 ways you can insert a new column to a pandas DataFrame:
Let's consider the following example:
import pandas as pd
df = pd.DataFrame({
'col_a':[True, False, False],
'col_b': [1, 2, 3],
})
print(df)
col_a col_b
0 True 1
1 False 2
2 False 3
Using simple assignment
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0 a
1 b
2 c
dtype: object
df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
col_a col_b col_c
0 True 1 NaN
1 False 2 a
2 False 3 b
Using assign()
e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
col_a col_b col_c
0 True 1.0 a
1 False 3.0 b
2 False 2.0 c
Using insert()
df.insert(len(df.columns), 'col_c', ser.values)
print(df)
col_a col_b col_c
0 True 1 a
1 False 2 b
2 False 3 c
Using concat()
ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
col_a col_b col_c
0 True 1.0 NaN
1 False 2.0 NaN
2 False 3.0 NaN
10 NaN NaN a
20 NaN NaN b
30 NaN NaN c
Upvotes: 7
Reputation: 91
import pandas as pd
# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
'b': [0,0,0.101208,-0.243316,0,0.075793],
'c': [0,0,-0.181532,0.051767,0,-0.451460],
'd': [0,0,0.241273,1.577318,0,-0.012493]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]
df['e'] = col_e
# add column 'e'
df['e'] = col_e
# Observe the result
df
Upvotes: 2
Reputation: 95
Simple way to add new columns to the existing dataframe is:
new_cols = ['a' , 'b' , 'c' , 'd']
for col in new_cols:
df[f'{col}'] = 0 #assiging 0 for the placeholder
print(df.columns)
Upvotes: 0
Reputation: 7376
you can insert new column by for loop like this:
for label,row in your_dframe.iterrows():
your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])
sample code here :
import pandas as pd
data = {
"any_of_column_in_your_dframe" : ["ersingulbahar","yagiz","TS"],
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
your_dframe = pd.DataFrame(data)
for label,row in your_dframe.iterrows():
your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])
print(your_dframe)
and output is here:
any_of_column_in_your_dframe | calories | duration | new_column_length |
---|---|---|---|
ersingulbahar | 420 | 50 | 13.0 |
yagiz | 380 | 40 | 5.0 |
TS | 390 | 45 | 2.0 |
Not: you can use like this as well:
your_dframe["new_column_length"]=your_dframe["any_of_column_in_your_dframe"].apply(len)
Upvotes: 0
Reputation: 85605
Edit 2017
As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign
:
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
Edit 2015
Some reported getting the SettingWithCopyWarning
with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.
>>> sLength = len(df1['a'])
>>> df1
a b c d
6 -0.269221 -0.026476 0.997517 1.294385
8 0.917438 0.847941 0.034235 -0.448948
>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e
6 -0.269221 -0.026476 0.997517 1.294385 1.757167
8 0.917438 0.847941 0.034235 -0.448948 2.228131
>>> pd.version.short_version
'0.16.1'
The SettingWithCopyWarning
aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead
>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e f
6 -0.269221 -0.026476 0.997517 1.294385 1.757167 -0.050927
8 0.917438 0.847941 0.034235 -0.448948 2.228131 0.006109
>>>
In fact, this is currently the more efficient method as described in pandas docs
Original answer:
Use the original df1 indexes to create the series:
df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
Upvotes: 1322
Reputation: 93
Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index. Then add it to the DF
e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f
Upvotes: 2
Reputation: 6769
If we want to assign a scaler value eg: 10 to all rows of a new column in a df:
df = df.assign(new_col=lambda x:10) # x is each row passed in to the lambda func
df will now have new column 'new_col' with value=10 in all rows.
Upvotes: 4
Reputation: 823
If you just need to create a new empty column then the shortest solution is:
df.loc[:, 'e'] = pd.Series()
Upvotes: 5
Reputation: 101
x=pd.DataFrame([1,2,3,4,5])
y=pd.DataFrame([5,4,3,2,1])
z=pd.concat([x,y],axis=1)
Upvotes: 3
Reputation: 639
this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.
so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.
I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.
data = pd.read_csv('data.csv')
def myFunction(x):
//implement your logic here
if so and so:
return a
return b
variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)
data['predicted_class'] = predicted_class
// check dataFrame, new column is included based on an existing column data for each row
data.head()
Upvotes: 2
Reputation: 618
Easiest ways:-
data['new_col'] = list_of_values
data.loc[ : , 'new_col'] = list_of_values
This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.
Upvotes: 45
Reputation: 109510
I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)
I assume that the index values in e
match those in df1
.
The easiest way to initiate a new column named e
, and assign it the values from your series e
:
df['e'] = e.values
assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you can also use assign
, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.
df1 = df1.assign(e=e.values)
As per this example (which also includes the source code of the assign
function), you can also include more than one column:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
a b mean_a mean_b
0 1 3 1.5 3.5
1 2 4 1.5 3.5
In context with your example:
np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))
>>> df1
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
7 1.532779 1.469359 0.154947 0.378163
9 1.230291 1.202380 -0.387327 -0.302303
>>> e
0 -1.048553
1 -1.420018
2 -1.706270
3 1.950775
4 -0.509652
dtype: float64
df1 = df1.assign(e=e.values)
>>> df1
a b c d e
0 1.764052 0.400157 0.978738 2.240893 -1.048553
2 -0.103219 0.410599 0.144044 1.454274 -1.420018
3 0.761038 0.121675 0.443863 0.333674 -1.706270
7 1.532779 1.469359 0.154947 0.378163 1.950775
9 1.230291 1.202380 -0.387327 -0.302303 -0.509652
The description of this new feature when it was first introduced can be found here.
Upvotes: 231
Reputation: 316
to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:
DataFrame.insert(loc, column, value)
Therefore, if you want to add the column e at the end of a data frame called df, you can use:
e = [-0.335485, -1.166658, -0.385571]
DataFrame.insert(loc=len(df.columns), column='e', value=e)
value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html
Upvotes: 11
Reputation: 2208
list_of_e
that has relevant data. df['e'] = list_of_e
Upvotes: 18
Reputation: 22238
It seems that in recent Pandas versions the way to go is to use df.assign:
df1 = df1.assign(e=np.random.randn(sLength))
It doesn't produce SettingWithCopyWarning
.
Upvotes: 60
Reputation: 461
If you want to set the whole new column to an initial base value (e.g. None
), you can do this: df1['e'] = None
This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.
Upvotes: 27
Reputation: 401
If the column you are trying to add is a series variable then just :
df["new_columns_name"]=series_variable_name #this will do it for you
This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.
Upvotes: 15
Reputation: 239
Foolproof:
df.loc[:, 'NewCol'] = 'New_Val'
Example:
df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
3 -0.147354 0.778707 0.479145 2.284143
4 -0.529529 0.000571 0.913779 1.395894
5 2.592400 0.637253 1.441096 -0.631468
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
8 0.606985 -2.232903 -1.358107 -2.855494
9 -0.692013 0.671866 1.179466 -1.180351
10 -1.093707 -0.530600 0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
18 0.693458 0.144327 0.329500 -0.655045
19 0.104425 0.037412 0.450598 -0.923387
df.drop([3, 5, 8, 10, 18], inplace=True)
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
4 -0.529529 0.000571 0.913779 1.395894
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
9 -0.692013 0.671866 1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
19 0.104425 0.037412 0.450598 -0.923387
df.loc[:, 'NewCol'] = 0
df
A B C D NewCol
0 -0.761269 0.477348 1.170614 0.752714 0
1 1.217250 -0.930860 -0.769324 -0.408642 0
2 -0.619679 -1.227659 -0.259135 1.700294 0
4 -0.529529 0.000571 0.913779 1.395894 0
6 0.757178 0.240012 -0.553820 1.177202 0
7 -0.986128 -1.313843 0.788589 -0.707836 0
9 -0.692013 0.671866 1.179466 -1.180351 0
11 -0.143273 -0.503199 -1.328728 0.610552 0
12 -0.923110 -1.365890 -1.366202 -1.185999 0
13 -2.026832 0.273593 -0.440426 -0.627423 0
14 -0.054503 -0.788866 -0.228088 -0.404783 0
15 0.955298 -1.430019 1.434071 -0.088215 0
16 -0.227946 0.047462 0.373573 -0.111675 0
17 1.627912 0.043611 1.743403 -0.012714 0
19 0.104425 0.037412 0.450598 -0.923387 0
Upvotes: 13
Reputation: 214927
If the data frame and Series object have the same index, pandas.concat
also works here:
import pandas as pd
df
# a b c d
#0 0.671399 0.101208 -0.181532 0.241273
#1 0.446172 -0.243316 0.051767 1.577318
#2 0.614758 0.075793 -0.451460 -0.012493
e = pd.Series([-0.335485, -1.166658, -0.385571])
e
#0 -0.335485
#1 -1.166658
#2 -0.385571
#dtype: float64
# here we need to give the series object a name which converts to the new column name
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df
# a b c d e
#0 0.671399 0.101208 -0.181532 0.241273 -0.335485
#1 0.446172 -0.243316 0.051767 1.577318 -1.166658
#2 0.614758 0.075793 -0.451460 -0.012493 -0.385571
In case they don't have the same index:
e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)
Upvotes: 13
Reputation: 32194
A pandas dataframe is implemented as an ordered dict of columns.
This means that the __getitem__
[]
can not only be used to get a certain column, but __setitem__
[] =
can be used to assign a new column.
For example, this dataframe can have a column added to it by simply using the []
accessor
size name color
0 big rose red
1 small violet blue
2 small tulip red
3 small harebell blue
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note that this works even if the index of the dataframe is off.
df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
However, if you have a pd.Series
and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
This is because a pd.Series
by default has an index enumerated from 0 to n. And the pandas [] =
method tries to be "smart"
When you use the [] =
method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series
This quickly causes cognitive dissonance, since the []=
method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []=
in code bases, but when exploring data in a notebook, it is fine.
If you have a pd.Series
and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.
You could downcast the pd.Series
to a np.ndarray
or a list
, this will do the trick.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
or
df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
But this is not very explicit.
Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".
Setting the index of the pd.Series
to be the index of the df
is explicit.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
Or more realistically, you probably have a pd.Series
already available.
protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index
3 no
2 no
1 no
0 yes
Can now be assigned
df['protected'] = protected_series
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
df.reset_index()
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
df.assign
While df.assign
make it more explicit what you are doing, it actually has all the same problems as the above []=
df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
Just watch out with df.assign
that your column is not called self
. It will cause errors. This makes df.assign
smelly, since there are these kind of artifacts in the function.
df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
You may say, "Well, I'll just not use self
then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.
Upvotes: 80
Reputation: 210812
For the sake of completeness - yet another solution using DataFrame.eval() method:
Data:
In [44]: e
Out[44]:
0 1.225506
1 -1.033944
2 -0.498953
3 -0.373332
4 0.615030
5 -0.622436
dtype: float64
In [45]: df1
Out[45]:
a b c d
0 -0.634222 -0.103264 0.745069 0.801288
4 0.782387 -0.090279 0.757662 -0.602408
5 -0.117456 2.124496 1.057301 0.765466
7 0.767532 0.104304 -0.586850 1.051297
8 -0.103272 0.958334 1.163092 1.182315
9 -0.616254 0.296678 -0.112027 0.679112
Solution:
In [46]: df1.eval("e = @e.values", inplace=True)
In [47]: df1
Out[47]:
a b c d e
0 -0.634222 -0.103264 0.745069 0.801288 1.225506
4 0.782387 -0.090279 0.757662 -0.602408 -1.033944
5 -0.117456 2.124496 1.057301 0.765466 -0.498953
7 0.767532 0.104304 -0.586850 1.051297 -0.373332
8 -0.103272 0.958334 1.163092 1.182315 0.615030
9 -0.616254 0.296678 -0.112027 0.679112 -0.622436
Upvotes: 5
Reputation: 18948
I was looking for a general way of adding a column of numpy.nan
s to a dataframe without getting the dumb SettingWithCopyWarning
.
From the following:
numpy
array of NaNs in-lineI came up with this:
col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})
Upvotes: 6
Reputation: 10960
This is the simple way of adding a new column: df['e'] = e
Upvotes: 340
Reputation: 1508
To add a new column, 'e', to the existing data frame
df1.loc[:,'e'] = Series(np.random.randn(sLength))
Upvotes: 6
Reputation: 1711
I got the dreaded SettingWithCopyWarning
, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:
df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength), index=df.index))
This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign
Works well for data flow type where you don't overwrite your intermediate values.
Upvotes: 25
Reputation: 1488
If you get the SettingWithCopyWarning
, an easy fix is to copy the DataFrame you are trying to add a column to.
df = df.copy()
df['col_name'] = values
Upvotes: 3
Reputation: 331
Let me just add that, just like for hum3, .loc
didn't solve the SettingWithCopyWarning
and I had to resort to df.insert()
. In my case false positive was generated by "fake" chain indexing dict['a']['e']
, where 'e'
is the new column, and dict['a']
is a DataFrame coming from dictionary.
Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.
Upvotes: 8
Reputation: 77
The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.
df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))
newCol = [3,5,7]
newName = 'C'
values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)
df = pd.DataFrame(values,columns=header)
Upvotes: 4