How to process multiple files based on the date in their names

Question

Let's asume I have a structure like this:

Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`

My current script collects the 3 files in each folder, processes them and makes 1 file of it. So right now my scripts does the job for 1 date.

Now lets asume the structure has changed to this:

Folder1
       `XX_20201201.txt`
       `XX_20201202.txt`
Folder1
       `YY_20201201.txt`
       `YY_20201202.txt`
Folder1
       `ZZ_20201201.txt`
       `ZZ_20201202.txt`
       `ZZ_20201203.txt`

I want my script to do the same now but for multiple dates. I want my script to check if a file has a date in its name which is also present in a list named missing_dates and if that file is available in each directory. If so I want to collect it and process it into 1 file. So if we assume 20201201, 20201202 and 20201203 are in missing_list. The following needs to happen.

The script will process the files of XX_20201201.txt, YY_20201201.txt and ZZ_20201201.txt into 1 file because that date is present in missing_dates AND its present in every directory.
The script will process the files of XX_20201202.txt, YY_20201202.txt and ZZ_20201202.txt into 1 file because that date is present in missing_dates AND its present in every directory..
The script will NOT process the file of ZZ_20201203.txt because that date is not present in every directory even though its present in the missing_dates.

So actually shortly said: 3 files with same date (in 3 different directories) with a date that is present in missing_dates = proceed

Please note that below code which is proceding the files into 1 file is already working, the underlying problem is that I have to adjust my loop so that it will always process more than 1 date. I dont know how to do that....

This is the code that reads the files:

for root, dirs, files in os.walk(counter_part):
    for file in files:
        date_files= re.search('_(.\d+).', file).group(1) 
        with open(file_path, 'r') as my_file:
            reader = csv.reader(my_file, delimiter = ',')
            next(reader)
            for row in reader:
                if filter_row(row):                      
                    vehicle_loc_dict[(row[9], location_token(row))].append(row)

dawg · Accepted Answer

With the tools in pathlib this is fairly easy.

Given:

% tree /tmp/test
/tmp/test
├── dir_1
│   ├── XX_20201201.txt
│   └── XX_20201202.txt
├── dir_2
│   ├── YY_20201201.txt
│   └── YY_20201202.txt
└── dir_3
    ├── ZZ_20201201.txt
    ├── ZZ_20201202.txt
    └── ZZ_20201203.txt

3 directories, 7 files

You can do:

from pathlib import Path

root=Path('/tmp/test')

missing_dates=['20201201']

for fn in (e for e in root.glob('**/*.txt') 
    if e.is_file() and any(d in str(e) for d in missing_dates)):
    print(fn)
    # here do what you mean by 'proceed' with path fn

Prints:

/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt

Or, you could do:

missing_dates=['20201201', '20201202']

for d in missing_dates:
    print(f"processing {d}")
    for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file()):
        print(fn)
        # here do what you mean by 'proceed'

Prints:

processing 20201201
/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt
processing 20201202
/tmp/test/dir_2/YY_20201202.txt
/tmp/test/dir_3/ZZ_20201202.txt
/tmp/test/dir_1/XX_20201202.txt

If you are only interested in groups of 3, you can do:

missing_dates=['20201201', '20201202', '20201203']

for d in missing_dates:
    print(f"processing {d}")
    files=[fn for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file())]
    if len(files)==3:
        print(files)

Prints:

processing 20201201
[PosixPath('/tmp/test/dir_2/YY_20201201.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201201.txt'), PosixPath('/tmp/test/dir_1/XX_20201201.txt')]
processing 20201202
[PosixPath('/tmp/test/dir_2/YY_20201202.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201202.txt'), PosixPath('/tmp/test/dir_1/XX_20201202.txt')]
processing 20201203

You can do the same thing with os.walk and glob.glob but it is just more work...

How to process multiple files based on the date in their names

Answers (1)

Related Questions