Reputation: 111
I have read all topics on append
vs concat
here, but still:
Why should I use append
if concat
has the same options and funcionallity?
Please correct me if I am mistaken.
append()
: can append two data frames at once; it's the same as concat in axis=0
concat()
: can concatenate multiple DFsTo concatenate multiple DFs, should I use append()
with a for-loop? Is it faster?
Let's assume that I am opening DFs from different files like:
df = pd.DataFrame()
for file in file_folder:
df = df.append(pd.read_csv(file))
OR
df = pd.DataFrame()
for file in file_folder:
df = pd.concat([df, pd.read_csv(file)])
The output is the same. So why?
EDIT: to speed it up should I do:
df_list = []
for file in file_folder:
df_list.append(pd.read_csv(file))
#and then use concat
df_all = pd.concat(df_list)`
right?
Upvotes: 5
Views: 9456
Reputation: 559
Since other answers are old, I would like to add that pd.append
is now deprecated in favour of pd.concat
with pandas 1.4. Thus, what follows are useful information for people running into this issue today.
To concatenate multiple data frames, you should not use append()
with a for-loop, because it can be computationally expensive and slow. You should create a list of data frames and then use concat()
once on the list. For example:
# Create an empty list of data frames
df_list = []
# Loop over the files and append each data frame to the list
for file in file_folder:
df_list.append(pd.read_csv(file))
# Concatenate the list of data frames into one data frame
df_all = pd.concat(df_list)
This way, you avoid creating intermediate copies of the data frames and reduce the memory usage and execution time.
Example | Using concat() | Using append() |
---|---|---|
Append two data frames row-wise | df = pd.concat([df1, df2], axis=0) | df = df1.append(df2) |
Append two data frames column-wise | df = pd.concat([df1, df2], axis=1) | df = df1.join(df2) |
Concatenate multiple data frames row-wise with different columns | df = pd.concat([df1, df2, df3], axis=0, join='outer') | df = pd.DataFrame() for df_i in [df1, df2, df3]: df = df.append(df_i) |
Concatenate multiple data frames row-wise with common columns | df = pd.concat([df1, df2, df3], axis=0, join='inner') | df = pd.DataFrame() for df_i in [df1, df2, df3]: df = df.append(df_i, join='inner') |
Concatenate multiple data frames with a hierarchical index | df = pd.concat([df1, df2, df3], axis=0, keys=['one', 'two', 'three']) | df_dict = {} for key, df_i in zip(['one', 'two', 'three'], [df1, df2, df3]): df_dict[key] = df_i df = pd.concat(df_dict.values(), axis=0, keys=df_dict.keys()) |
As it is clear, concat()
uses more concise syntax and is faster. Thus, no reason to use append()
.
Even if you like the deprecated pd.append
, remember that the pd.concat
function is more efficient and faster than the pd.append
function and has the same options and functionality. Like Matus Dubrava mentioned, append
calls concat
under the hood anyway.
Upvotes: 8
Reputation: 14492
append
is a convenience method which calls concat
under the hood. If you look at the implementation of the append
method, you will see that.
def append(...
...
if isinstance(other, (list, tuple)):
to_concat = [self, *other]
else:
to_concat = [self, other]
return concat(
to_concat,
ignore_index=ignore_index,
verify_integrity=verify_integrity,
sort=sort,
)
As for the performance. Both of these called over and over in a loop can be computationally expensive. You should just create a list and do one concatenation after you are done looping.
From docs:
iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
Upvotes: 5
Reputation: 1113
It is an arbitrary choice to either use .append()
or .concat()
For example if you have to merge two dataframes on axis=1
then go with concat
and on the other hand if you have to merge two dataframes on axis=0
go with whatever choice you prefer to use or works best for you!
Personally I would reccomend you to go with the pandas default DF merger(concat) to avoid any delays, latencies and most importantly bugs in you code.
Hope so that you find this information useful and answers you question!
Happy Coding!
Upvotes: -1