TangerCity
TangerCity

Reputation: 845

How to process multiple files based on the date in their names

Let's asume I have a structure like this:

Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`

My current script collects the 3 files in each folder, processes them and makes 1 file of it. So right now my scripts does the job for 1 date.

Now lets asume the structure has changed to this:

Folder1
       `XX_20201201.txt`
       `XX_20201202.txt`
Folder1
       `YY_20201201.txt`
       `YY_20201202.txt`
Folder1
       `ZZ_20201201.txt`
       `ZZ_20201202.txt`
       `ZZ_20201203.txt`

I want my script to do the same now but for multiple dates. I want my script to check if a file has a date in its name which is also present in a list named missing_dates and if that file is available in each directory. If so I want to collect it and process it into 1 file. So if we assume 20201201, 20201202 and 20201203 are in missing_list. The following needs to happen.

  1. The script will process the files of XX_20201201.txt, YY_20201201.txt and ZZ_20201201.txt into 1 file because that date is present in missing_dates AND its present in every directory.
  2. The script will process the files of XX_20201202.txt, YY_20201202.txt and ZZ_20201202.txt into 1 file because that date is present in missing_dates AND its present in every directory..
  3. The script will NOT process the file of ZZ_20201203.txt because that date is not present in every directory even though its present in the missing_dates.

So actually shortly said: 3 files with same date (in 3 different directories) with a date that is present in missing_dates = proceed

Please note that below code which is proceding the files into 1 file is already working, the underlying problem is that I have to adjust my loop so that it will always process more than 1 date. I dont know how to do that....

This is the code that reads the files:

for root, dirs, files in os.walk(counter_part):
    for file in files:
        date_files= re.search('_(.\d+).', file).group(1) 
        with open(file_path, 'r') as my_file:
            reader = csv.reader(my_file, delimiter = ',')
            next(reader)
            for row in reader:
                if filter_row(row):                      
                    vehicle_loc_dict[(row[9], location_token(row))].append(row)
    

Upvotes: 1

Views: 334

Answers (1)

dawg
dawg

Reputation: 103754

With the tools in pathlib this is fairly easy.

Given:

% tree /tmp/test
/tmp/test
├── dir_1
│   ├── XX_20201201.txt
│   └── XX_20201202.txt
├── dir_2
│   ├── YY_20201201.txt
│   └── YY_20201202.txt
└── dir_3
    ├── ZZ_20201201.txt
    ├── ZZ_20201202.txt
    └── ZZ_20201203.txt

3 directories, 7 files

You can do:

from pathlib import Path

root=Path('/tmp/test')

missing_dates=['20201201']

for fn in (e for e in root.glob('**/*.txt') 
    if e.is_file() and any(d in str(e) for d in missing_dates)):
    print(fn)
    # here do what you mean by 'proceed' with path fn

Prints:

/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt

Or, you could do:

missing_dates=['20201201', '20201202']

for d in missing_dates:
    print(f"processing {d}")
    for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file()):
        print(fn)
        # here do what you mean by 'proceed'

Prints:

processing 20201201
/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt
processing 20201202
/tmp/test/dir_2/YY_20201202.txt
/tmp/test/dir_3/ZZ_20201202.txt
/tmp/test/dir_1/XX_20201202.txt

If you are only interested in groups of 3, you can do:

missing_dates=['20201201', '20201202', '20201203']

for d in missing_dates:
    print(f"processing {d}")
    files=[fn for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file())]
    if len(files)==3:
        print(files)

Prints:

processing 20201201
[PosixPath('/tmp/test/dir_2/YY_20201201.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201201.txt'), PosixPath('/tmp/test/dir_1/XX_20201201.txt')]
processing 20201202
[PosixPath('/tmp/test/dir_2/YY_20201202.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201202.txt'), PosixPath('/tmp/test/dir_1/XX_20201202.txt')]
processing 20201203

You can do the same thing with os.walk and glob.glob but it is just more work...

Upvotes: 1

Related Questions