trench
trench

Reputation: 5355

Find the latest file for each calendar month in a folder

The code below works as I need it to, but I feel like there must be a better way. I have a folder with daily(ish) files inside of it. All of them have the same prefix and the date they were sent as the file name. On certain days, no file was sent at all though. My task it to read the last file of each month (most of the time it is the last day, but April's last file was the 28th, July was the 29th, etc).

This is using the pathlib module, which I like to continue to use.

files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements
dates = [pd.to_datetime(date[0] + '-' + date[1] + '-' + date[2]) for date in file_dates] #construct the proper date format
x = pd.DataFrame(dates)
x['month'] = x[0].dt.strftime('%Y-%m') + '-01'
max_value = x.groupby(['month'])[0].max().reset_index()
max_value[0] = max_value[0].dt.strftime('%Y_%m_%d')
monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in max_value[0].values]

df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])

I believe this is a case where, because I have a hammer (pandas), everything looks like a nail (I turn everything into a dataframe). I am also trying to get used to list comprehensions after several years of not using them.

Upvotes: 0

Views: 795

Answers (3)

Philip B.
Philip B.

Reputation: 637

To my knowledge this is going to be difficult to do with list comprehension since you have to compare the current element with the next element.

However there are simpler solutions that will get you there without pandas.

The example below just loops over a string list with the file dates and keeps the date before the month changes. Since your list is sorted that should do the trick. I am assuming YYYY_MM_DD date formats

files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '') for file in files] 

#adding a dummy date because we're comparing to the next element
file_dates.append('0000_00_00')
result = []
for i, j in enumerate(file_dates[:-1]):
    if j[6:7] != file_dates[i+1][6:7]: 
        result.append(j)

monthly_files = [str(ROOT / 'prefix_') + date + '.csv.xz' for date in result]

df = pd.concat([pd.read_csv(file, usecols=columns, sep='\t', compression='xz', dtype=object) for file in monthly_files])

Upvotes: 1

Bharadwaj Yarlagadda
Bharadwaj Yarlagadda

Reputation: 61

So the file names would be prefix_<date> and the date is in format %Y-%m-%d.

import os
from datetime import datetime as dt
from collections import defaultdict
from pathlib import Path

group_by_month = defaultdict(list)
files = []

# Assuming the folder is the data folder path itself.
for file in Path(folder).iterdir():
    if os.path.isfile(file) and file.startswith('prefix_'):
        # Convert the string date to a datetime object
        converted_dt = dt.strptime(str(file).split('prefix_')[1], 
                                   '%Y-%m-%d')

        # Group the dates by month
        group_by_month[converted_dt.month].append(converted_dt)

# Get the max of all the dates stored.
max_dates = {month: max(group_by_month[month]) 
             for month in group_by_month.keys()}

# Get the files that match the prefix and the max dates
for file in Path(folder).iterdir():
    for date in max_date.values():
        if ('prefix_' + dt.strftime(date, '%Y-%m-%d')) in str(file):
            files.append(file)

PS: I haven't worked with pandas a lot. So, went with the native style to get the files that match the max date of a month.

Upvotes: 1

Yorian
Yorian

Reputation: 2062

There's probably better, but here's my try:

files = sorted(ROOT.glob('**/*.csv*'))
file_dates = [Path(file.stem).stem.replace('prefix_', '').split('_') for file in files] #replace everything but a list of the date elements

df = pd.DataFrame(file_dates, columns=['y', 'm', 'd'], dtype='int')
monthly = [str(yy)+'-'+str(mm)+'-'+str(df.loc[(df['y'] == yy) & (df['m'] == mm), 'd'].max()) for yy in df.y.unique() for mm in df.m.unique()]

Upvotes: 2

Related Questions