PhE
PhE

Reputation: 16644

Create a Pandas Dataframe by appending one row at a time

How do I create an empty DataFrame, then add rows, one by one?

I created an empty DataFrame:

df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row at the end and fill a single field with:

df = df._set_value(index=len(df), col='qty1', value=10.0)

It works for only one field at a time. What is a better way to add new row to df?

Upvotes: 1418

Views: 2421543

Answers (30)

qwr
qwr

Reputation: 10958

Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column, given we know our columns beforehand. This data structure is like how we would access a column as df["col"]. At the end we construct our DataFrame once.

In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists (of length n), versus one list of n dictionaries (with c entries). The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists which is simpler and more efficient than creating new dictionaries.

# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}

# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")

# At the end, construct our DataFrame
df = pd.DataFrame(data)
#   Animal  Color
# 0    cow   blue
# 1  horse    red
# 2  mouse  black

Upvotes: 17

ShikharDua
ShikharDua

Reputation: 10009

In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:

  1. Create a list of dictionaries in which each dictionary corresponds to an input data row.
  2. Create a data frame from this list.

I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.

rows_list = []
for row in input_rows:
    dict1 = {}
    # get input row in dictionary format
    # key = col_name
    dict1.update(blah..) 

    rows_list.append(dict1)

df = pd.DataFrame(rows_list)               

Upvotes: 810

cs95
cs95

Reputation: 402814

from pandas >= 2.0, append has been removed!

DataFrame.append was deprecated in version 1.4 and removed from the pandas API entirely in version 2.0.

See the docs on Deprecations as well as this github issue that originally proposed its deprecation.

If you are running pandas version 2.0 or later, you will likely run into the following error:

AttributeError: 'DataFrame' object has no attribute 'append' for DataFrame

Keep reading if you would like to learn about more idiomatic alternatives to append.


NEVER grow a DataFrame!

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?

Here are the most important reasons, taken from my post here.

  1. It is always cheaper/faster to append to a list and create a DataFrame in one go.
  2. Lists take up less memory and are a much lighter data structure to work with, append, and remove.
  3. dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
  4. An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.

This is The Right Way™ to accumulate your data

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

Note that if some_function_that_yields_data() returns smaller DataFrames, you can accumulate individual DataFrames inside a list and then make a single call to pd.concat at the end.

These options are horrible

  1. append or concat inside a loop

    append and concat aren't inherently bad in isolation. The problem starts when you iteratively call them inside a loop - this results in quadratic memory usage.

    # Creates empty DataFrame and appends
    df = pd.DataFrame(columns=['A', 'B', 'C'])
    for a, b, c in some_function_that_yields_data():
        df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)  
        # This is equally bad:
        # df = pd.concat(
        #       [df, pd.Series({'A': i, 'B': b, 'C': c})], 
        #       ignore_index=True)
    
  2. Empty DataFrame of NaNs

    Never create a DataFrame of NaNs as the columns are initialized with object (slow, un-vectorizable dtype).

    # Creates DataFrame of NaNs and overwrites values.
    df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
    for a, b, c in some_function_that_yields_data():
        df.loc[len(df)] = [a, b, c]
    

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

Upvotes: 353

mpa
mpa

Reputation: 88

Here are the 3 regularly mentioned options and their shortcomings for adding

  • a single row (not multiple rows)
  • optimized for readability (not for runtime performance, e.g. allow copy the DataFrame even though not preferred)
  • columns can have different dtypes
  • keep the dtype of all columns
  • the index can have any form, e.g. 'holes' in an integer series
  • keep the dtype of the df.index

The code setup:

df = pd.DataFrame({'carId': [1, 4, 7], 'maxSpeed': [1.1, 4.4, 7.7]})
df = df.astype({
    'carId': np.uint16,
    'maxSpeed': np.float32,
})
df.set_index('carId', drop=False, inplace=True)
assert df.index.dtype == np.uint64

# the row to add
additional_row = [9, 9.9]
assert len(df.columns) == len(additional_row)
original_dtypes = df.dtypes
original_index_dtype = df.index.dtype

1) pd.concat()

df_new_row = pd.DataFrame([additional_row], columns=df.columns)
newDf = pd.concat([df, df_new_row])
assert df.dtypes.equals(newDf.dtypes)  # fails: carId is np.int64 and maxSpeed is np.float64
assert newDf.dtypes.equals(original_dtypes)  # fails: newDf.index.dype is np.float64

2) df.loc[]

df.loc[additional_row[0], :] = additional_row
assert df.index.dtype == original_index_dtype
assert df.dtypes.equals(original_dtypes)  # fails: carId and maxSpeed are np.float64

3) df.append()

depreciated since pandas 1.4.0

solution

df.loc[] leaves the df.index intact, so I typically convert the types of the columns:

df.loc[additional_row[0], :] = additional_row
df = df.astype(original_dtypes)
assert df.index.dtype == original_index_dtype
assert df.dtypes.equals(original_dtypes)

Note that df.astype() creates a copy of the df. df.astype(copy=False) avoids this if you can accept the side effects of the copy parameter.

If you do not want to set the index explicitly, use e.g. df.loc[df.index.max() + 1, :] = additional_row. Note that df.index.max() fails if df is empty.

Unfortunately, How to add an extra row to a pandas dataframe has been marked as duplicate and points to this question. The title of this post "by appending one row at a time" implies that regularly adding multiple lines to a DataFrame is a good idea. I agree with many previous comments that there are probably not many uses cases for this. However, adding a single row to a DataFrame occurs more often, even though it's still an edge case.

Upvotes: 0

NPE
NPE

Reputation: 500683

You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.

For example:

def append_row(df, row):
    return pd.concat([
                df, 
                pd.DataFrame([row], columns=row.index)]
           ).reset_index(drop=True)

df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})

df = append_row(df, new_row)

Upvotes: 358

Joaquim
Joaquim

Reputation: 440

This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.

import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
    for y in fruits:
#         print(x, y)
        dict1 = dict([('x',x),('y',y)])
#         print(f'dict 1 {dict1}')
        feat_list.append(dict1)
#         print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

Upvotes: -1

Prajot Kuvalekar
Prajot Kuvalekar

Reputation: 6658

If you always want to add a new row at the end, use this:

df.loc[len(df)] = ['name5', 9, 0]

Upvotes: 24

Gerard
Gerard

Reputation: 177

If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end. It seems to be even faster than converting a list of dicts.

import pandas as pd
import numpy as np
from string import ascii_uppercase

startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
    npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)

Upvotes: 0

Harshal Deore
Harshal Deore

Reputation: 1228

initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}

df = pd.DataFrame(initial_data)

df

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4

val_1 = [10]
val_2 = [14]
val_3 = [20]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4
0    10    14    20

You can use a for loop to iterate through values or can add arrays of values.

val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4
0    10    14    20
1    11    15    21
2    12    16    22
3    13    17    43

Upvotes: 3

hansrajswapnil
hansrajswapnil

Reputation: 639

You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).

So, I input the data for a new row in a duct() and index in a list.

new_dict = {put input for new row here}
new_list = [put your index here]

new_df = pd.DataFrame(data=new_dict, index=new_list)

df = pd.concat([existing_df, new_df])

Upvotes: 3

srikanth Gattu
srikanth Gattu

Reputation: 23

Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.

That idea makes me to write the below code.

df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns:   # Here df.columns gives us the main dictionary key
    df2[x][101] = values[i]   # Here the 101 is our index number. It is also the key of the sub dictionary
    i += 1

Upvotes: 0

Shahir Ansari
Shahir Ansari

Reputation: 1848

If you want to add a row at the end, append it as a list:

valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)

Upvotes: 4

kamran kausar
kamran kausar

Reputation: 4603

pandas.DataFrame.append

DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'

Code

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)

With ignore_index set to True:

df.append(df2, ignore_index=True)

Upvotes: 0

RockStar
RockStar

Reputation: 1314

You can use a generator object to create a Dataframe, which will be more memory efficient over the list.

num = 10

# Generator function to generate generator object
def numgen_func(num):
    for i in range(num):
        yield ('name_{}'.format(i), (i*i), (i*i*i))

# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )

df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))

To add raw to existing DataFrame you can use append method.

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400  }])

Upvotes: 13

Armali
Armali

Reputation: 19395

We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.

But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.

Upvotes: 0

shivampip
shivampip

Reputation: 2144

Here is the way to add/append a row in a Pandas DataFrame:

def add_row(df, row):
    df.loc[-1] = row
    df.index = df.index + 1
    return df.sort_index()

add_row(df, [1,2,3])

It can be used to insert/append a row in an empty or populated Pandas DataFrame.

Upvotes: 7

Qinsi
Qinsi

Reputation: 820

I figured out a simple and nice way:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

Note the caveat with performance as noted in the comments.

Upvotes: 18

Mikhail_Sam
Mikhail_Sam

Reputation: 11238

In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.

Performance

  1. Using .append (NPE's answer)
  2. Using .loc (fred's answer)
  3. Using .loc with preallocating (FooBar's answer)
  4. Using dict and create DataFrame in the end (ShikharDua's answer)

Runtime results (in seconds):

Approach 1000 rows 5000 rows 10 000 rows
.append 0.69 3.39 6.78
.loc without prealloc 0.74 3.90 8.35
.loc with prealloc 0.24 2.58 8.70
dict 0.012 0.046 0.084

So I use addition through the dictionary for myself.


Code:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.

Upvotes: 460

Vineet Jain
Vineet Jain

Reputation: 1575

Make it simple. By taking a list as input which will be appended as a row in the data-frame:

import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
    res_list = list(map(int, input().split()))
    res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)

Upvotes: 1

hkyi
hkyi

Reputation: 3874

For the sake of a Pythonic way:

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())

   lib  qty1  qty2
0  NaN  10.0   NaN

Upvotes: 48

Jack Daniel
Jack Daniel

Reputation: 2611

Create a new record (data frame) and add to old_data_frame.

Pass a list of values and the corresponding column names to create a new_record (data_frame):

new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])

old_data_frame = pd.concat([old_data_frame, new_record])

Upvotes: 9

user3250815
user3250815

Reputation: 149

This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.

While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!

import pandas as pd

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

Upvotes: 14

Nasser Al-Wohaibi
Nasser Al-Wohaibi

Reputation: 4661

For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.

Add rows through loc/ix on non existing key index data. For example:

In [1]: se = pd.Series([1,2,3])

In [2]: se
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: se[5] = 5.

In [4]: se
Out[4]:
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

Or:

In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....:

In [2]: dfi
Out[2]:
   A  B
0  0  1
1  2  3
2  4  5

In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [4]: dfi
Out[4]:
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
In [5]: dfi.loc[3] = 5

In [6]: dfi
Out[6]:
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

Upvotes: 79

Mahdi
Mahdi

Reputation: 235

If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:

df.loc[len(df)] = new_list

If you want to add a new data frame new_df under data frame df, then you can use:

df.append(new_df)

Upvotes: 0

fred
fred

Reputation: 10060

You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

Upvotes: 919

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39860

All you need is loc[df.shape[0]] or loc[len(df)]


# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False] 

or

df.loc[len(df)] = ['col1Value', 100, 'col3Value', False] 

Upvotes: 4

Brian Burns
Brian Burns

Reputation: 22042

You can also build up a list of lists and convert it to a dataframe -

import pandas as pd

columns = ['i','double','square']
rows = []

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

giving

    i   double  square
0   0   0   0
1   1   2   1
2   2   4   4
3   3   6   9
4   4   8   16
5   5   10  25

Upvotes: 38

tomatom
tomatom

Reputation: 469

This will take care of adding an item to an empty DataFrame. The issue is that df.index.max() == nan for the first index:

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])

df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]

Upvotes: -3

qed
qed

Reputation: 23134

Another way to do it (probably not very performant):

# add a row
def add_row(df, row):
    colnames = list(df.columns)
    ncol = len(colnames)
    assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
    return df.append(pd.DataFrame([row], columns=colnames))

You can also enhance the DataFrame class like this:

import pandas as pd
def add_row(self, row):
    self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row

Upvotes: 3

W.P. McNeill
W.P. McNeill

Reputation: 17056

You can append a single row as a dictionary using the ignore_index option.

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black

Upvotes: 81

Related Questions