Skruff
Skruff

Reputation: 75

Python Merge Pandas Dataframe

I am new to Python and am looking for a simple solution.

I have several .csv files with the same structure (number of columns and lines) in one folder. The path is: C:\temp

Now I want to read all these .csv files into a new dataframe, which I want to export later as a new .csv file.

up to now i have read each .csv file by hand and saved it into a pandas dataframe.

Here is an example:

df1 = pd.read_csv(r "C:\temp\df1.csv", header= None)
df2 = pd.read_csv(r "C:\temp\df2.csv", header= None)

df1

0 id Feature
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 7 17
7 8 15
8 9 12
9 10 13
10 11 23

Then I used .append to merge the dataframes.

df_new = df1.append(df2)

0   id  Feature
1   1   12
2   2   13
3   3   14
4   4   15
5   5   16
6   7   17
7   8   15
8   9   12
9   10  13
10  11  23
0   id  Feature
1   1   14
2   2   9
3   3   3
4   4   8
5   5   9
6   7   1
7   8   32
8   9   7
9   10  3
10  11  12

df_new.to_csv('df_new.csv', index=False)

Unfortunately this version always has the header with me, but I don't need it. So I deleted it afterwards by hand.

Isn't there a faster version? I'm thinking of a for loop that opens all existing .csv files in the path and reads them line by line into a new dataframe and at the end of the loop makes a .csv file out of it? Unfortunately I have no experience with loops.

I appreciate your help.

Upvotes: 0

Views: 110

Answers (2)

Erik van de Ven
Erik van de Ven

Reputation: 4975

You could do something like this:

frames = []
for i in range(0,10):
    frames.append(pd.read_csv(r "C:\temp\df%d.csv" % i))

df = pd.concat(frames)

Which should concatenate all csv files into one dataframe and which works if all filenames are structured like df1d.csv, df2d.csv etc. and you know the amount of files. If you just want to loop over all files in the folder, you could do something like this:

import glob

path = "C:\temp\*.csv"

frames = []
for fname in glob.glob(path):
    frames.append(pd.read_csv(fname))

df = pd.concat(frames)

The extended version, without a loop would be:

df1 = pd.read_csv(r "C:\temp\df1.csv")
df2 = pd.read_csv(r "C:\temp\df2.csv")
df3 = pd.read_csv(r "C:\temp\df3.csv")
frames = [df1, df2, df3]
df = pd.concat(frames)

Afterwards you could save the dataframe back to a csv file:

df.to_csv(file_name)

Upvotes: 1

thelogicalkoan
thelogicalkoan

Reputation: 620

In [1]: import pandas as pd

In [2]: from io import StringIO

In [3]: df = pd.read_csv(StringIO("""0 id Feature
   ...: 1 1 12
   ...: 2 2 13
   ...: 3 3 14
   ...: 4 4 15
   ...: 5 5 16
   ...: 6 7 17
   ...: 7 8 15
   ...: 8 9 12
   ...: 9 10 13
   ...: 10 11 23"""), sep=' ')

In [4]: df1 = pd.read_csv(StringIO("""0   id  Feature
   ...: 1   1   14
   ...: 2   2   9
   ...: 3   3   3
   ...: 4   4   8
   ...: 5   5   9
   ...: 6   7   1
   ...: 7   8   32
   ...: 8   9   7
   ...: 9   10   3
   ...: 10   11   12"""), sep='   ')

In [10]: pd.concat([df, df1])
Out[10]: 
    0  id  Feature
0   1   1       12
1   2   2       13
2   3   3       14
3   4   4       15
4   5   5       16
5   6   7       17
6   7   8       15
7   8   9       12
8   9  10       13
9  10  11       23
0   1   1       14
1   2   2        9
2   3   3        3
3   4   4        8
4   5   5        9
5   6   7        1
6   7   8       32
7   8   9        7
8   9  10        3
9  10  11       12

In [11]: %timeit pd.concat([df, df1])

188 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: df.append(df1)
Out[14]: 
    0  id  Feature
0   1   1       12
1   2   2       13
2   3   3       14
3   4   4       15
4   5   5       16
5   6   7       17
6   7   8       15
7   8   9       12
8   9  10       13
9  10  11       23
0   1   1       14
1   2   2        9
2   3   3        3
3   4   4        8
4   5   5        9
5   6   7        1
6   7   8       32
7   8   9        7
8   9  10        3
9  10  11       12

In [15]: %timeit df.append(df1)
197 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

With pandas version '1.1.3'

You can clearly check that pd.concat is faster than df.append(df1).

For working with loops, you can create a variable with the filenames and keep a list of dataframes from those files using for loop, something like this

filename = ['1.csv', '2.csv']

dfs = []

for name in filename:
    dfs.append(pd.read_csv(name))

new_df = pd.concat(dfs)

This is easy, efficient, cleaner and faster as well.

And then save the file to csv.

new_df.to_csv(out_filename)

Upvotes: 0

Related Questions