Reputation: 111

Why and when use append() instead of concat() in Pandas?

I have read all topics on append vs concat here, but still: Why should I use append if concat has the same options and funcionallity?

Please correct me if I am mistaken.

append() : can append two data frames at once; it's the same as concat in axis=0
concat() : can concatenate multiple DFs

To concatenate multiple DFs, should I use append() with a for-loop? Is it faster?

Let's assume that I am opening DFs from different files like:

df = pd.DataFrame()
for file in file_folder:
    df = df.append(pd.read_csv(file))

OR 

df = pd.DataFrame()
for file in file_folder:
   df = pd.concat([df, pd.read_csv(file)])

The output is the same. So why?

EDIT: to speed it up should I do:

df_list = []
for file in file_folder:

    df_list.append(pd.read_csv(file))

#and then use concat

df_all = pd.concat(df_list)`

right?

Upvotes: 5

Answers (3)

Harshvardhan

Reputation: 559

Since other answers are old, I would like to add that pd.append is now deprecated in favour of pd.concat with pandas 1.4. Thus, what follows are useful information for people running into this issue today.

To concatenate multiple data frames, you should not use append() with a for-loop, because it can be computationally expensive and slow. You should create a list of data frames and then use concat() once on the list. For example:

# Create an empty list of data frames
df_list = []

# Loop over the files and append each data frame to the list
for file in file_folder:
    df_list.append(pd.read_csv(file))

# Concatenate the list of data frames into one data frame
df_all = pd.concat(df_list)

This way, you avoid creating intermediate copies of the data frames and reduce the memory usage and execution time.

Example	Using concat()	Using append()
Append two data frames row-wise	df = pd.concat([df1, df2], axis=0)	df = df1.append(df2)
Append two data frames column-wise	df = pd.concat([df1, df2], axis=1)	df = df1.join(df2)
Concatenate multiple data frames row-wise with different columns	df = pd.concat([df1, df2, df3], axis=0, join='outer')	df = pd.DataFrame() for df_i in [df1, df2, df3]: df = df.append(df_i)
Concatenate multiple data frames row-wise with common columns	df = pd.concat([df1, df2, df3], axis=0, join='inner')	df = pd.DataFrame() for df_i in [df1, df2, df3]: df = df.append(df_i, join='inner')
Concatenate multiple data frames with a hierarchical index	df = pd.concat([df1, df2, df3], axis=0, keys=['one', 'two', 'three'])	df_dict = {} for key, df_i in zip(['one', 'two', 'three'], [df1, df2, df3]): df_dict[key] = df_i df = pd.concat(df_dict.values(), axis=0, keys=df_dict.keys())

As it is clear, concat() uses more concise syntax and is faster. Thus, no reason to use append().

Even if you like the deprecated pd.append, remember that the pd.concat function is more efficient and faster than the pd.append function and has the same options and functionality. Like Matus Dubrava mentioned, append calls concat under the hood anyway.

Upvotes: 8

Matus Dubrava

Reputation: 14492

append is a convenience method which calls concat under the hood. If you look at the implementation of the append method, you will see that.

def append(...
    ...
    if isinstance(other, (list, tuple)):
        to_concat = [self, *other]
    else:
        to_concat = [self, other]
    return concat(
        to_concat,
        ignore_index=ignore_index,
        verify_integrity=verify_integrity,
        sort=sort,
    )

As for the performance. Both of these called over and over in a loop can be computationally expensive. You should just create a list and do one concatenation after you are done looping.

From docs:

iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Upvotes: 5

Aryan

Reputation: 1113

It is an arbitrary choice to either use .append() or .concat()
For example if you have to merge two dataframes on axis=1 then go with concat and on the other hand if you have to merge two dataframes on axis=0 go with whatever choice you prefer to use or works best for you!
Personally I would reccomend you to go with the pandas default DF merger(concat) to avoid any delays, latencies and most importantly bugs in you code.

Hope so that you find this information useful and answers you question!
Happy Coding!

Upvotes: -1

Why and when use append() instead of concat() in Pandas?

Answers (3)

Related Questions