MysterioProgrammer91
MysterioProgrammer91

Reputation: 569

File containing filenames with dates, want to get filenames for the first available date of each month

this question is similar to a question I asked earlier: Extracting certain elements from a list of a string and turning into datetime with Pandas

However, imagine there a folder with many files (csv's)....and they are all in the format of ranging from 2004 to 2017 and the files are somewhat sparadic as sometimes they are daily frequency but can be monthly. Additionally there is not a known pattern of file name convention, however the dates are always within the file in format of yyyymmdd and all are after 2000, and there are no 20 numbers in the filenames:

Eg, X = ['director_send_20140212ds.csv', 'send20140213abs.csv', 'mike20140214.csv', ....etc]

I want to get the the filenames of only the filenames which contain the first date of the month. Eg, if the data is starting from 2000 to 2017, I only want a list of the file names which contain the first day of the month (for which their is a file available). So the list will be much smaller than the one available but only gives FILENAME of the first day of the month. If the first day of the month for which there is a file is the 10th it would be included, and if there is not a filename for a particular month, it would go to the next month.

So, eventually if there was a ideal world with daily frequency filenames we would get 12 filenames for a particular year.

Upvotes: 0

Views: 3161

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

I'd do it this way:

Source list of file names:

In [43]: X = ['director_send_20140212ds.csv',
   ....:      'send20140213abs.csv',
   ....:      'mike20140214.csv',
   ....:      'director_send_20140430ds.csv',
   ....:      'send20140402abs.csv',
   ....:      'mike20141214.csv',
   ....: ]

let's create a DF out of it:

In [44]: df = pd.DataFrame({'fn':X})

now let's parse dates out of file names:

In [45]: df['date'] = pd.to_datetime(df['fn'].str.extract(r'.*(\d{8}).*', expand=False))

now we can easily find the minimum date per month

In [47]: df.loc[df.groupby([df.date.dt.year, df.date.dt.month])['date'].idxmin(), 'fn']
Out[47]:
0    director_send_20140212ds.csv
4             send20140402abs.csv
5                mike20141214.csv
Name: fn, dtype: object

Upvotes: 1

IanS
IanS

Reputation: 16251

I believe that, rather than extracting the date from the file name, it is easier to iterate over the dates, and match corresponding file names:

# first, list all files
import os
files = pd.Series(os.listdir('path_to_folder'))
# second, iterate over dates
for date in pd.date_range('2000-01-01', '2017-01-01', freq='MS').strftime('%Y%m%d'):
    for file_name in files[files.str.contains(date)]:
        print(file_name)
        # do what you need to do

Note that since the question is tagged with pandas, I make use of pandas functionality to (a) generate the date range and (b) match the date in the file names.

Upvotes: 2

Related Questions