Ted
Ted

Reputation: 499

Reading filenames from the directory

I want to get two separate lists of file names using glob, with each list having the same type of files. I have two type of data files. For example,

  1. 2018-01-02.dat
  2. 2018-01-02_patients.dat

The only difference is that the second file type is followed by "_patients". A date can be anything but the format is consistent. How can I accomplish this using glob?

Upvotes: 3

Views: 93

Answers (5)

Nathan Vērzemnieks
Nathan Vērzemnieks

Reputation: 5603

glob isn't particularly well suited to this task, but regular expressions are. You can use os.listdir(path to get a list of all files and use re.match to ensure the presence of a date, maybe followed by "_patients", definitely followed by ".dat". Here's how I'd do that:

import re
import os

pattern = '[0-9]{4}-[0-9]{2}-[0-9]{2}(_patients)?\.dat$'
def is_patient_file(filename):
    return re.match(pattern, filename) is not None

def get_patient_files(path):
    all_files = os.listdir(path)
    return filter(is_patient_file, all_files)

print(get_patient_files('.'))

The parts of the regular expression are:

  • the date: [0-9]{4}-[0-9]{2}-[0-9]{2}
    • that is, four digits, dash, two digits, dash, two digits.
  • maybe patients: (_patients)?
  • definitely .dat: \.dat
  • and no more: $

Upvotes: 0

JL Peyret
JL Peyret

Reputation: 12154

touch 2018-01-02_patients.dat 2018-01-02.dat 1980-01-02.dat 1980-01-02_patients.dat

pgm:

import glob
li = glob.glob("????-*-*.dat")

patients = [fn for fn in li if "patients." in fn]
dates = [fn for fn in li if not "_patients." in fn]

print ("patients", patients)
print ("dates", dates)

output:

('patients', ['1980-01-02_patients.dat', '2018-01-02_patients.dat'])
('dates', ['1980-01-02.dat', '2018-01-02.dat'])

Upvotes: 0

murphy1310
murphy1310

Reputation: 667

If those are the only two types of files in the directory, just use two lists and remove the duplicates from the bigger one, that way you will have the required lists.
Something like this..

list1 = glob.glob('*.dat')
list2 = glob.glob('*_patients.dat')

result_list_2 = list2
result_list_1 = [x for x in list1 if x not in list2]

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71451

You can use re with glob:

import glob
import re
final_files = [i for i in glob.glob('*') if re.findall('\.dat$|_patients\.dat$', i)]

Upvotes: 1

heemayl
heemayl

Reputation: 41987

To precisely match the digits, you can use the glob patterns:

[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].dat  # matches e.g. 2018-01-02.dat
[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]_patients.dat  # matches e.g. 2018-01-02_patients.dat

You can also use the ? instead of [0-9] to match any single character, if you are sure about the absence of any other alike patterns.

In [103]: glob.glob('[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].dat')
Out[103]: ['2018-01-02.dat', '2014-03-12.dat']

In [104]: glob.glob('[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]_patients.dat')
Out[104]: ['2018-01-02_patients.dat', '2014-03-12_patients.dat']

Upvotes: 3

Related Questions