Reputation: 91
I'm working with a directory that has a large amount of files that get put in it daily. I'm wondering how I could combine all of the daily files into a single daily file. For example, a file path contains the following files:
Name Date Modified
a.csv 01/01/2020 9:00 AM
b.csv 01/01/2020 10:00 AM
c.csv 01/01/2020 12:00 AM
d.csv 01/03/2020 8:00 AM
e.csv 01/03/2020 11:00 AM
f.csv 01/04/2020 9:00 AM
Desired output of files:
Name
combined_01012020.csv
combined_01032020.csv
combined_01042020.csv
Just combining every file for January 1st into one file and so on. So far I have done this, but it's not what I want:
import os
import glob
import time
from datetime import timedelta,datetime
import pandas as pd
file_path = 'C:\csv_files'
files = glob.glob(os.path.join(path,"*.csv")
new_csv = list()
for f in files:
new_csv.append(f)
time = time.strftime("%Y%m%d")
new_csv.to_csv(r'C:\daily_csv_files\combined' + str(time)+'.csv', index=False)
All this does is combines all files I have into a single file. I've been trying to figure the dates portion of what I want, but haven't had any luck and just combine everything into one large file. Any help would be appreciated.
Upvotes: 1
Views: 929
Reputation: 23099
Using pathlib and defaultdict
we can iteratively search your directory for csv and then we can create a dictionary with the file as the key and the value is the datetime, we can then reverse engineer this dict to create a single dict with datetime as the key and the file as a list of files. (im sure this could be done in one step but I'm shite with dictionaries)
Note - make sure you make a copy of your files.
set your target path to write your files.
to get the latest modified time we can take adavantage of the stat
attribute which gives us several bits of information about the file. The one we are interested in is when the file was last modified.
we use f.stat().st_mtime
on your Pathlib
object convert it into a datetime and then format the date to match your specified formatting %m_%d_%y
from pathlib import Path
from collections import defaultdict
from datetime import datetime
my_path = 'your_dir'
files = [f for f in Path(my_path).glob('*.csv')]
# create a dictionary with file as key and time as value.
file_dict = dict(
zip(
files,
[datetime.fromtimestamp(f.stat().st_mtime).strftime("%m_%d_%y") for f in files],
)
)
#reverse engineer the list so it has a list appended by the date.
file_merger_list = defaultdict(list)
for k,v in file_dict.items():
file_merger_list[v].append(k)
# make folder to write to.
target_path = Path(your_path).joinpath('mergedfiles').mkdir(parents=True,exists_ok=False)
for date,file in file_merger_list.items():
df = pd.concat([pd.read_csv(f) for f in file])
df.to_csv(target_path.joinpath(f"combined_{date}.csv"))
output
['combined_01_02_20.csv',
'combined_01_28_20.csv',
'combined_12_02_19.csv',
'combined_12_09_19.csv']
Upvotes: 1