Atihska
Atihska

Reputation: 5126

Find missing filenames in sequence off numbers stored in a list

I have a string list of timestamp (date_millisecondtime.csv) based filenames like these:

    [..., file_20181105_110001.csv, file_20181105_120002.csv,    file_20181105_130002.csv, file_20181105_140002.csv,    file_20181105_150003.csv, file_20181105_160002.csv,    file_20181105_170002.csv, file_20181105_200002.csv,    
file_20181105_210002.csv, file_20181106_010002.csv, file_20181106_020002.csv, file_20181106_030002.csv...]

So here files with date 2018-11-05 (Nov 5, 2018) with timestamp 11, 12, 13, 14, 15, 16, 17, 20 and 21.

I want to print only filenames 18 and 19 as they are missing. And the valid time range is from 1 - 23 so if hour in filenames are not present in this range for a given day (here its 2018-11-05), print those missing hours files.

Upvotes: 2

Views: 779

Answers (2)

wendelbsilva
wendelbsilva

Reputation: 772

Another solution in case you need to check also files missing at the beginning/end of the list (e.g: hour 0-10, 22 and 23)

filenames = ['file_20181105_110001.csv', 'file_20181105_120002.csv', 'file_20181105_150003.csv']
pos = 0
for h in range(0, 23):
    n = "file_20181105_" + str(h).zfill(2)
    if pos < len(filenames) and n == filenames[pos][: len(n)]:
        print("Found", h)
        pos += 1
    else: print("Not found", h)

Of course, you can build the n with the day you want to go through in multiple different ways. If needed, you can create another loop to go through days.

Edit:

If we want to check for more than one day, we can loop through the days checking its files/hours.

IMHO, i would suggest a lot of changes in the following code depending on the use case, number of days, number of file names, preference and code style, etc.

filenames = ['file_20181104_110001.csv', 'file_20181105_120002.csv', 'file_20181105_150003.csv']
pos = 0
missing = []
for d in (4, 5):
    for h in range(0, 23):
        n = "file_201811" + str(d).zfill(2) + "_" + str(h).zfill(2)
        if pos < len(filenames) and n == filenames[pos][: len(n)]:
            pos += 1
            print("Found", d, h)
        else:
            print("Not Found", d, h)

Upvotes: 0

jpp
jpp

Reputation: 164703

One solution is to use a set comprehension to extract the times present. If I understand your requirement, you can then calculate the min and max times and take the difference from a set derived from a range:

L = ['file_20181105_110001.csv', 'file_20181105_120002.csv', 'file_20181105_130002.csv',
     'file_20181105_140002.csv', 'file_20181105_150003.csv', 'file_20181105_160002.csv',
     'file_20181105_170002.csv', 'file_20181105_200002.csv', 'file_20181105_210002.csv']

present = {int(i.rsplit('_', 1)[-1][:2]) for i in L}

min_time, max_time = min(present), max(present)

res = set(range(min_time, max_time)) - present  # {18, 19}

You can then build your filenames from the missing times. I'll leave this as an exercise [hint: list comprehension].

Upvotes: 2

Related Questions