Steveiepete
Steveiepete

Reputation: 91

Combine files into a single file by date

I'm working with a directory that has a large amount of files that get put in it daily. I'm wondering how I could combine all of the daily files into a single daily file. For example, a file path contains the following files:

Name    Date Modified
a.csv   01/01/2020 9:00 AM
b.csv   01/01/2020 10:00 AM
c.csv   01/01/2020 12:00 AM
d.csv   01/03/2020 8:00 AM
e.csv   01/03/2020 11:00 AM
f.csv   01/04/2020 9:00 AM

Desired output of files:

Name    
combined_01012020.csv 
combined_01032020.csv
combined_01042020.csv

Just combining every file for January 1st into one file and so on. So far I have done this, but it's not what I want:

import os
import glob
import time
from datetime import timedelta,datetime
import pandas as pd

file_path = 'C:\csv_files'
files = glob.glob(os.path.join(path,"*.csv")

new_csv = list()
for f in files:
    new_csv.append(f)

time = time.strftime("%Y%m%d")
new_csv.to_csv(r'C:\daily_csv_files\combined' + str(time)+'.csv', index=False)

All this does is combines all files I have into a single file. I've been trying to figure the dates portion of what I want, but haven't had any luck and just combine everything into one large file. Any help would be appreciated.

Upvotes: 1

Views: 929

Answers (1)

Umar.H
Umar.H

Reputation: 23099

Using pathlib and defaultdict

we can iteratively search your directory for csv and then we can create a dictionary with the file as the key and the value is the datetime, we can then reverse engineer this dict to create a single dict with datetime as the key and the file as a list of files. (im sure this could be done in one step but I'm shite with dictionaries)

Note - make sure you make a copy of your files.

set your target path to write your files.

to get the latest modified time we can take adavantage of the stat attribute which gives us several bits of information about the file. The one we are interested in is when the file was last modified.

we use f.stat().st_mtime on your Pathlib object convert it into a datetime and then format the date to match your specified formatting %m_%d_%y

from pathlib import Path
from collections import defaultdict
from datetime import datetime

my_path = 'your_dir'

files = [f for f in Path(my_path).glob('*.csv')]
# create a dictionary with file as key and time as value.
file_dict = dict(
    zip(
        files,
        [datetime.fromtimestamp(f.stat().st_mtime).strftime("%m_%d_%y") for f in files],
    )
)

#reverse engineer the list so it has a list appended by the date.

file_merger_list = defaultdict(list)

for k,v in file_dict.items():
    file_merger_list[v].append(k)

# make folder to write to.
target_path = Path(your_path).joinpath('mergedfiles').mkdir(parents=True,exists_ok=False)


for date,file in file_merger_list.items():
    df = pd.concat([pd.read_csv(f) for f in file])
    df.to_csv(target_path.joinpath(f"combined_{date}.csv"))

output

 ['combined_01_02_20.csv',
 'combined_01_28_20.csv',
 'combined_12_02_19.csv',
 'combined_12_09_19.csv']

Upvotes: 1

Related Questions