reallymemorable
reallymemorable

Reputation: 1024

Loop over all CSVs in a directory and merge the data using pandas

I know how to merge CSVs by converting them into dataframes, combining the dfs, and outputting the .csv. I also know how to use glob to grab all of the CSVs in a directory. (See below.)

But I don't know how to traverse the dir to get the CSVs, and load the dfs into data_frames = [df1, df2, df3]. I suppose I need to somehow get each df out of the for loop and into the data_frames list on each iteration.

import csv
import datetime
import time
import pandas as pd
import glob
from functools import reduce


path = "/home/reallymemorable/Documents/scripts/*dateFormatted.csv"

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

data_frames = [df1, df2, df3]

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'], how='outer'), data_frames).fillna('NULL')

pd.DataFrame.to_csv(df_merged, 'merged.csv', sep=',', na_rep='.', index=False)


for fname in glob.glob(path):
    print(fname)

EDIT:

Using a suggestion below, I implemented this:

path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('dateFormatted.csv')), ignore_index=True).reset_index()
print(df)
df.to_csv('TEST.csv', index=False)

But I am confused, because the print(df) is stacking the dataframes rather than joining them:

reallymemorable@gort:~/Documents/scripts$ python3 3-mai_ccxt_concatCrypto.py 
     index   Timestamp  BTC/USDT_Open  BTC/USDT_High  BTC/USDT_Low  ...  ETH/USDT_Open  ETH/USDT_High  ETH/USDT_Low  ETH/USDT_Close  ETH/USDT_Volume
0        0  2019-01-30        3410.04        3478.00       3387.10  ...            NaN            NaN           NaN             NaN              NaN
1        1  2019-01-31        3457.50        3489.20       3418.80  ...            NaN            NaN           NaN             NaN              NaN
2        2  2019-02-01        3434.10        3488.00       3401.20  ...            NaN            NaN           NaN             NaN              NaN
3        3  2019-02-02        3462.20        3526.40       3440.29  ...            NaN            NaN           NaN             NaN              NaN
4        4  2019-02-03        3504.06        3511.09       3426.00  ...            NaN            NaN           NaN             NaN              NaN
..     ...         ...            ...            ...           ...  ...            ...            ...           ...             ...              ...
995    995  2020-06-08            NaN            NaN           NaN  ...         244.57         247.70        240.59          246.40     383116.54719
996    996  2020-06-09            NaN            NaN           NaN  ...         246.37         249.82        238.00          243.80     420327.17133
997    997  2020-06-10            NaN            NaN           NaN  ...         243.79         250.28        242.00          247.78     431285.61715
998    998  2020-06-11            NaN            NaN           NaN  ...         247.78         250.09        226.20          230.51     816456.89161
999    999  2020-06-12            NaN            NaN           NaN  ...         230.46         239.38        228.19          236.76     415920.13123

[1000 rows x 12 columns]

Upvotes: 0

Views: 1405

Answers (2)

NotAName
NotAName

Reputation: 4367

I prefer using os, but you can do all the same with glob if you prefer:

Note: here I'm using path as a folder where CSV files are located.

path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('csv')), ignore_index=True).reset_index()

EDIT: If there are other (non-csv) files present in the folder you may want to add an if-statement into pd.concat as I've done in the code above.

Upvotes: 1

tripleee
tripleee

Reputation: 189948

If I am at all guessing what you are asking, here's a refactoring which merges all the dataframes in the directory.

# Removed unused imports
import pandas as pd
import glob


data_frames = []

for fname in glob.glob("/home/reallymemorable/Documents/scripts/*dateFormatted.csv"):
   data_frames.append(pd.read_table('file1.csv', sep=','))

Upvotes: 2

Related Questions