Reputation: 1024
I know how to merge CSVs by converting them into dataframes, combining the dfs, and outputting the .csv. I also know how to use glob to grab all of the CSVs in a directory. (See below.)
But I don't know how to traverse the dir to get the CSVs, and load the dfs into data_frames = [df1, df2, df3]
. I suppose I need to somehow get each df out of the for
loop and into the data_frames
list on each iteration.
import csv
import datetime
import time
import pandas as pd
import glob
from functools import reduce
path = "/home/reallymemorable/Documents/scripts/*dateFormatted.csv"
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
data_frames = [df1, df2, df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'], how='outer'), data_frames).fillna('NULL')
pd.DataFrame.to_csv(df_merged, 'merged.csv', sep=',', na_rep='.', index=False)
for fname in glob.glob(path):
print(fname)
EDIT:
Using a suggestion below, I implemented this:
path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('dateFormatted.csv')), ignore_index=True).reset_index()
print(df)
df.to_csv('TEST.csv', index=False)
But I am confused, because the print(df)
is stacking the dataframes rather than joining them:
reallymemorable@gort:~/Documents/scripts$ python3 3-mai_ccxt_concatCrypto.py
index Timestamp BTC/USDT_Open BTC/USDT_High BTC/USDT_Low ... ETH/USDT_Open ETH/USDT_High ETH/USDT_Low ETH/USDT_Close ETH/USDT_Volume
0 0 2019-01-30 3410.04 3478.00 3387.10 ... NaN NaN NaN NaN NaN
1 1 2019-01-31 3457.50 3489.20 3418.80 ... NaN NaN NaN NaN NaN
2 2 2019-02-01 3434.10 3488.00 3401.20 ... NaN NaN NaN NaN NaN
3 3 2019-02-02 3462.20 3526.40 3440.29 ... NaN NaN NaN NaN NaN
4 4 2019-02-03 3504.06 3511.09 3426.00 ... NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ...
995 995 2020-06-08 NaN NaN NaN ... 244.57 247.70 240.59 246.40 383116.54719
996 996 2020-06-09 NaN NaN NaN ... 246.37 249.82 238.00 243.80 420327.17133
997 997 2020-06-10 NaN NaN NaN ... 243.79 250.28 242.00 247.78 431285.61715
998 998 2020-06-11 NaN NaN NaN ... 247.78 250.09 226.20 230.51 816456.89161
999 999 2020-06-12 NaN NaN NaN ... 230.46 239.38 228.19 236.76 415920.13123
[1000 rows x 12 columns]
Upvotes: 0
Views: 1405
Reputation: 4367
I prefer using os
, but you can do all the same with glob
if you prefer:
Note: here I'm using path
as a folder where CSV files are located.
path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('csv')), ignore_index=True).reset_index()
EDIT: If there are other (non-csv) files present in the folder you may want to add an if-statement into pd.concat
as I've done in the code above.
Upvotes: 1
Reputation: 189948
If I am at all guessing what you are asking, here's a refactoring which merges all the dataframes in the directory.
# Removed unused imports
import pandas as pd
import glob
data_frames = []
for fname in glob.glob("/home/reallymemorable/Documents/scripts/*dateFormatted.csv"):
data_frames.append(pd.read_table('file1.csv', sep=','))
Upvotes: 2