Reputation: 499
I want to get two separate lists of file names using glob, with each list having the same type of files. I have two type of data files. For example,
The only difference is that the second file type is followed by "_patients". A date can be anything but the format is consistent. How can I accomplish this using glob?
Upvotes: 3
Views: 93
Reputation: 5603
glob
isn't particularly well suited to this task, but regular expressions are. You can use os.listdir(path
to get a list of all files and use re.match
to ensure the presence of a date, maybe followed by "_patients", definitely followed by ".dat". Here's how I'd do that:
import re
import os
pattern = '[0-9]{4}-[0-9]{2}-[0-9]{2}(_patients)?\.dat$'
def is_patient_file(filename):
return re.match(pattern, filename) is not None
def get_patient_files(path):
all_files = os.listdir(path)
return filter(is_patient_file, all_files)
print(get_patient_files('.'))
The parts of the regular expression are:
[0-9]{4}-[0-9]{2}-[0-9]{2}
(_patients)?
\.dat
$
Upvotes: 0
Reputation: 12154
touch 2018-01-02_patients.dat 2018-01-02.dat 1980-01-02.dat 1980-01-02_patients.dat
pgm:
import glob
li = glob.glob("????-*-*.dat")
patients = [fn for fn in li if "patients." in fn]
dates = [fn for fn in li if not "_patients." in fn]
print ("patients", patients)
print ("dates", dates)
output:
('patients', ['1980-01-02_patients.dat', '2018-01-02_patients.dat'])
('dates', ['1980-01-02.dat', '2018-01-02.dat'])
Upvotes: 0
Reputation: 667
If those are the only two types of files in the directory, just use two lists and remove the duplicates from the bigger one, that way you will have the required lists.
Something like this..
list1 = glob.glob('*.dat')
list2 = glob.glob('*_patients.dat')
result_list_2 = list2
result_list_1 = [x for x in list1 if x not in list2]
Upvotes: 0
Reputation: 71451
You can use re
with glob
:
import glob
import re
final_files = [i for i in glob.glob('*') if re.findall('\.dat$|_patients\.dat$', i)]
Upvotes: 1
Reputation: 41987
To precisely match the digits, you can use the glob patterns:
[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].dat # matches e.g. 2018-01-02.dat
[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]_patients.dat # matches e.g. 2018-01-02_patients.dat
You can also use the ?
instead of [0-9]
to match any single character, if you are sure about the absence of any other alike patterns.
In [103]: glob.glob('[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].dat')
Out[103]: ['2018-01-02.dat', '2014-03-12.dat']
In [104]: glob.glob('[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]_patients.dat')
Out[104]: ['2018-01-02_patients.dat', '2014-03-12_patients.dat']
Upvotes: 3