Peter S
Peter S

Reputation: 575

Python: Compare Filenames in Folder

Filename1: Data_A_2015-07-29_16-25-55-313.txt

Filename2: Data_B_2015-07-29_16-25-55-313.txt

I need to compare all Fils in the Folder to make sure that for every TimeStamp, there's ONE A and ONE B File.

The second and centisecond part of the Filename is not always the same in both Files so what counts is the Date_%H:%M --> 2 Files for every Minute is what I'm looking for

(e.g.: Data_A_2015-07-29_16-25-55-313.txt and Data_B_2015-07-29_16-25-54-200.txt belong together)

I tried following code:

for root,dirs,files in os.walk(source):
for a_items in files:
    if a_items.__contains__("A"):
        A_ALL_List.append(a_items)         # One List with all A Files
        a_1 = a_item.split('_')[1]           # A Part
        a_2 = a_item.split('_',)[2]          # Date Part
        a_3 = a_item.split('_')[3]           # TimeStamp %H%M%S%SS incl. .txt
        a_4 = a_3.rsplit('.txt',1)[0]        # TimeStamp %H%N%S%SS excl. .txt
        a_5 = a_4.rsplit ('-',1)[0]          # TimeStamp %H%M%S
        a_6 = a_5.rsplit ('-',1)[0]          # TimeStamp %H%M
        a_lvl1 = a_1 + '_' + a_2 +'_' + a_3  # A_Date_FULLTimeStamp.txt
        A_Lvl1.append(a_lvl1)                # A_Date_TimeStamp.txt LIST
        a_lvl2 = a_lvl1.rsplit('.txt',1)[0]  # Split .txt
        A_Lvl2.append(a_lvl2)                # A_Date_TimeStamp LIST
        a_lvl3 = a_1 + '_' + a_2 + '_' + a_5 # A_Date_(%H%M%S)TimeStamp
        A_Lvl3.append(a_lvl3)                # A_Date_(%H%M%S)TimeStamp LIST
        a_lvl4 = a_2 + '_' + a_4             # Date_FULLTimeStamp
        A_Lvl4.append(a_lvl4)                # Date_FULLTimeStamp LIST
        a_lvl5 = a_2 + '_' + a_5             # Date_(%H%M%S)TimeStamp
        A_Lvl5.append(a_lvl5)                # Date_(%H%M%S)TimeStamp LIST
        a_lvl6 = a_2 + '_' + a_6             # Date_(%H%M)TimeStamp
        A_Lvl6.append(a_lvl6)                # Date_(%H%M)TimeStamp LIST
for b_items in files:                        # Did the same for B now
    if b_items.__contains__("B"):
        B_All_List.append(b_items)

That way I got Lists for both filenames containing only the Parts I want to compare --> e.g. if i'd compare the Lists A_Lvl6 with B_Lvl6 i'd would compare only the Date Part and the Hour and Minutes from the Timestamp.

I figured out, that there are more B Files than A Files so I moved on:

for Difference in B_Lvl6: # Data in B
if Difference not in A_Lvl6: # Not in A
    DiffList.append(Difference)

That way I got an output of Data where I had no A Files but B Files --> DiffList

Now I would like to look for the corresponding B Files from that DiffList (since there are no matching A Files) and move those B files into another Folder --> In Main Folder should only be A and B Files with matching TimeStamps (%H%M)

My question (finally):

  1. How can I manage the last part, where I want to get rid of all A or B Files with no TimeStamp Partner.

  2. Is my method a proper way to tackle a problem like this or is it completely insane? I've been using Python for 1.5 weeks now so any suggestions on packages and tutorials would be welcome.

Solution I used:

source='/tmp'

import os
import re`
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')
def test_list(l):
return (len(l)==2 and {t[1] for t in l}!=set('AB'))   

def round_time(dto, round_to=60):
     seconds = (dto - dto.min).seconds
     rounding = (seconds-round_to/2) // round_to * round_to
     return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)             

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
         m=pat.search(fn)
         if m:   
           d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
           fnames.setdefault(str(d), []).append((p, m.group(1)))

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
   for fn in v:
      print fn[0]

Upvotes: 2

Views: 5056

Answers (3)

dawg
dawg

Reputation: 103834

Given these five file names:

$ ls Data*
Data_A_2015-07-29_16-25-55-313.txt  
Data_B_2015-07-29_16-25-54-200.txt
Data_A_2015-07-29_16-26-56-314.txt  
Data_B_2015-07-29_16-26-54-201.txt
Data_A_2015-07-29_16-27-54-201.txt

You can use a regex to locate the key info: Demo

Since we are dealing with time stamps, the time should be rounded to the nearest time mark of interest.

Here is a function that will round up or down to the closest minute:

import datetime as dt

def round_time(dto, round_to=60):
    seconds = (dto - dto.min).seconds
    rounding = (seconds+round_to/2) // round_to * round_to
    return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)  

Combine that with looping though the files, you can combine into a dictionary of lists with the key being the time stamp rounded to a minute.

(I suspect that your files are all in the same directory, so I am showing this with os.listdir instead of os.walk since os.walk recursively goes through multiple directories)

import os
import re
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
        m=pat.search(fn)
        if m:   
            d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
            fnames.setdefault(str(d), []).append((p, m.group(1)))

print fnames

Prints:

{'2015-07-29 16:28:00': [('/tmp/Data_A_2015-07-29_16-27-54-201.txt', 'A')], '2015-07-29 16:27:00': [('/tmp/Data_A_2015-07-29_16-26-56-314.txt', 'A'), ('/tmp/Data_B_2015-07-29_16-26-54-201.txt', 'B')], '2015-07-29 16:26:00': [('/tmp/Data_A_2015-07-29_16-25-55-313.txt', 'A'), ('/tmp/Data_B_2015-07-29_16-25-54-200.txt', 'B')]}

The five files have one file that does not have a pair. You can filter for all the the file lists that are not length two or do not have an A and B pair match.

First, define a test function that will test for that:

def test_list(l):
    return (len(l)==2 and {t[1] for t in l}==set('AB'))   

Then use a list comprehension to find all the entries from the dict that do not meet your conditions:

>>> [(k, v) for k, v in fnames.items() if not test_list(v)]
[('2015-07-29 16:28:00', [('/tmp/Data_A_2015-07-29_16-27-54-201.txt', 'A')])]

Then act on those files:

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
    for fn in v:
        print fn  # could be os.remove(fn) 

The same basic method works with os.walk but you may have files in multiple directories.

Here is the complete listing:

source='/tmp'

import os
import re
import datetime as dt

pat=re.compile(r'^Data_(A|B)_(\d{4}-\d{2}-\d{2}_\d+-\d+-\d+-\d+)')

def test_list(l):
    return (len(l)==2 and {t[1] for t in l}==set('AB'))   

def round_time(dto, round_to=60):
    seconds = (dto - dto.min).seconds
    rounding = (seconds+round_to/2) // round_to * round_to
    return dto + dt.timedelta(0,rounding-seconds,-dto.microsecond)             

fnames={}
for fn in os.listdir(source):
    p=os.path.join(source, fn)
    if os.path.isfile(p):
        m=pat.search(fn)
        if m:   
            d=round_time(dt.datetime.strptime(m.group(2), '%Y-%m-%d_%H-%M-%S-%f'), round_to=60)
            fnames.setdefault(str(d), []).append((p, m.group(1)))

for k, v in [(k, v) for k, v in fnames.items() if not test_list(v)]:
    for fn in v:
        print fn[0]    # This is the file that does NOT have a pair -- delete?

Upvotes: 1

pasztorpisti
pasztorpisti

Reputation: 3726

I think simply ignoring the second and the millisecond part isn't a good idea. It can happen that one of your files has 01:01:59:999 and another one has 01:02:00:000. The difference is only one millisecond, but it affects the minute part too. A better solution would be parsing the datetimes and calculating the timedelta between them. But let's go with the simple stupid version. I thought something like this could do the job. Tailor it to your needs if it isn't exactly what you need:

import os
import re

pattern = re.compile(r'^Data_(?P<filetype>A|B)_(?P<datetime>\d\d\d\d\-\d\d\-\d\d_\d\d\-\d\d)\-\d\d\-\d\d\d\.txt$')

def diff_dir(dir, files):
    a_set, b_set = {}, {}
    sets  = {'A': a_set, 'B': b_set}
    for file in files:
        path = os.path.join(dir, file)
        match = pattern.match(file)
        if match:
            sets[match.group('filetype')][match.group('datetime')] = path
        else:
            print("Filename doesn't match our pattern: " + path)
    a_datetime_set, b_datetime_set = set(a_set.keys()), set(b_set.keys())
    a_only_datetimes = a_datetime_set - b_datetime_set
    b_only_datetimes = b_datetime_set - a_datetime_set
    for dt in a_only_datetimes:
        print(a_set[dt])
    for dt in b_only_datetimes:
        print(b_set[dt])

def diff_dir_recursively(rootdir):
    for dir, subdirs, files in os.walk(rootdir):
        diff_dir(dir, files)

if __name__ == '__main__':
    # use your root directory here
    rootdir = os.path.join(os.path.dirname(__file__), 'dir')
    diff_dir_recursively(rootdir)

Upvotes: 2

CivFan
CivFan

Reputation: 15102

I wanted to post a partial answer to point out how you can assign the results of the split to names, and give them meaningful names. This usually makes solving the problem a little easier.

def match_files(files):
    result = {}

    for filename in files:
        data, letter, date, time_txt = filename.split('_')
        time, ext = time_txt.split('.')
        hour, min, sec, ns = time.split('-')

        key = date + '_' + hour + '-' + min

        # Initialize dictionary if it doesn't already exist.
        if not result.has_key(key):
            result[key] = {}

        result[key][letter] = filename


    return result



filename1 = 'Data_A_2015-07-29_16-25-55-313.txt'
filename2 = 'Data_B_2015-07-29_16-25-55-313.txt'

file_list = [filename1, filename2]


match_files(file_list)

Output:

In [135]: match_files(file_list)
Out[135]: 
{'2015-07-29_16-25': {'A': 'Data_A_2015-07-29_16-25-55-313.txt',
  'B': 'Data_B_2015-07-29_16-25-55-313.txt'}}

Upvotes: 1

Related Questions