Reputation: 2269
I have this list of log files that I want to sort by the date inside each one: as you can see, there is after LOG_
a number which is the key I want to sort the string.
The date is in yyyymmdd
format.
LOGS\LOG_20190218_91_02.LOG
LOGS\LOG_20190218_91_05.LOG
LOGS\LOG_20190218_91_00.LOG
LOGS\LOG_20190218_91_22.LOG
LOGS\LOG_20190218_91_10.LOG
LOGS\LOG_20190219_56_22.LOG
LOGS\LOG_20190219_56_24.LOG
LOGS\LOG_20190219_56_25.LOG
LOGS\LOG_20190219_56_26.LOG
LOGS\LOG_20190219_56_03.LOG
LOGS\LOG_20190220_56_22.LOG
LOGS\LOG_20190220_56_07.LOG
LOGS\LOG_20190220_56_13.LOG
LOGS\LOG_20190220_56_17.LOG
LOGS\LOG_20190220_56_21.LOG
I tried various approaches:
extract the date value, add them to list, distinct them (using set
) and, by each one, take the string/filepath and add it to a list. The problem is that dates could vary in size (here there are only 3, but they could be more). So using fixed lists is (maybe) out of scope.
verify each string and check with the previous/next to see if the date changed. If changed, then add all the previous paths/string to a list. Still same problem but maybe this approach could be improved.
manually copy-paste the files in folders for each date and then work with them. This is out of scope by now because we are talking about huge files (gigs).
What I would like to understand is how the second soulution could be implemented. How can properly store the files/strings with same date in a own list ?
Expected result...
list20190218 = [all LOG files with 20190218 value in name]
list20190219 = [all LOG files with 20190219 value in name]
list20190220 = [all LOG files with 20190220 value in name]
...but with a variable number of lists.
Thanks
Upvotes: 1
Views: 82
Reputation: 2269
I'll post also my solution. It's more verbose but maybe a little bit easier to be understood than list comprehension.
import os
import glob
from itertools import groupby
from operator import itemgetter
LOGS = ['LOGS\LOG_20190218_91_02.LOG',
'LOGS\LOG_20190218_91_05.LOG',
'LOGS\LOG_20190218_91_00.LOG',
'LOGS\LOG_20190218_91_22.LOG',
'LOGS\LOG_20190218_91_10.LOG',
'LOGS\LOG_20190219_56_22.LOG',
'LOGS\LOG_20190219_56_24.LOG',
'LOGS\LOG_20190219_56_25.LOG',
'LOGS\LOG_20190219_56_26.LOG',
'LOGS\LOG_20190219_56_03.LOG',
'LOGS\LOG_20190220_56_22.LOG',
'LOGS\LOG_20190220_56_07.LOG',
'LOGS\LOG_20190220_56_13.LOG',
'LOGS\LOG_20190220_56_17.LOG',
'LOGS\LOG_20190220_56_21.LOG']
dateList = []
for log in LOGS:
baseName = os.path.basename(log)
date = baseName.split('_')[1][:8]
dateList .append(date)
dateList = (set(dateList))
myDict = {}
for date in dateList:
for log in LOGS:
if date in log:
myDict.setdefault(date, [])
myDict[date].append(log)
for key, value in myDict.items():
print(key, value)
Output:
20190220 ['LOGS\\LOG_20190220_56_22.LOG', 'LOGS\\LOG_20190220_56_07.LOG', 'LOGS\\LOG_20190220_56_13.LOG', 'LOGS\\LOG_20190220_56_17.LOG', 'LOGS\\LOG_20190220_56_21.LOG']
20190219 ['LOGS\\LOG_20190219_56_22.LOG', 'LOGS\\LOG_20190219_56_24.LOG', 'LOGS\\LOG_20190219_56_25.LOG', 'LOGS\\LOG_20190219_56_26.LOG', 'LOGS\\LOG_20190219_56_03.LOG']
20190218 ['LOGS\\LOG_20190218_91_02.LOG', 'LOGS\\LOG_20190218_91_05.LOG', 'LOGS\\LOG_20190218_91_00.LOG', 'LOGS\\LOG_20190218_91_22.LOG', 'LOGS\\LOG_20190218_91_10.LOG']
If you use print(myDict["20190220"])
...
['LOGS\\LOG_20190220_56_22.LOG', 'LOGS\\LOG_20190220_56_07.LOG', 'LOGS\\LOG_20190220_56_13.LOG', 'LOGS\\LOG_20190220_56_17.LOG', 'LOGS\\LOG_20190220_56_21.LOG']
Upvotes: 0
Reputation: 88236
A clean way to do this would be using dictionaries. In this case the keys would be the dates and the values would be the corresponding list. In order to group the elements in the list you could use itertools.groupby
. You also need to specify that you want to group the list using the date, for that you can extract the date substring from each string in the key
argument:
from itertools import groupby
from operator import itemgetter
d = {k:list(v) for k,v in groupby(data, key=lambda x: itemgetter(1)(x.split('_')))}
Then simply do:
d['20190220']
['LOGS\\LOG_20190220_56_22.LOG\n',
'LOGS\\LOG_20190220_56_07.LOG\n',
'LOGS\\LOG_20190220_56_13.LOG\n',
'LOGS\\LOG_20190220_56_17.LOG\n',
'LOGS\\LOG_20190220_56_21.LOG']
Upvotes: 1
Reputation: 23815
Code below.
Create a named tuple that will keep the file date. Sort the list using the date as key.
from collections import namedtuple, defaultdict
import datetime
FileAttr = namedtuple('FileAttr', 'name date')
files = ['LOGS\LOG_20190218_91_02.LOG',
'LOGS\LOG_20190218_91_05.LOG',
'LOGS\LOG_20190218_91_00.LOG',
'LOGS\LOG_20190218_91_22.LOG',
'LOGS\LOG_20190218_91_10.LOG',
'LOGS\LOG_20190219_56_22.LOG',
'LOGS\LOG_20190219_56_24.LOG',
'LOGS\LOG_20190219_56_25.LOG',
'LOGS\LOG_20190219_56_26.LOG',
'LOGS\LOG_20180219_56_26.LOG',
'LOGS\LOG_20170219_56_26.LOG',
'LOGS\LOG_20190219_56_03.LOG',
'LOGS\LOG_20190220_56_22.LOG',
'LOGS\LOG_20190220_56_07.LOG',
'LOGS\LOG_20190220_56_13.LOG',
'LOGS\LOG_20190220_56_17.LOG',
'LOGS\LOG_20190220_56_21.LOG']
files_ex = []
for f in files:
left_idx = f.find('_')
right_idx = f.find('.')
date_part = f[left_idx + 1:right_idx][:-6]
year = int(date_part[:4])
month = int(date_part[4:6])
day = int(date_part[6:8])
dt = datetime.datetime(year, month, day)
files_ex.append(FileAttr(f, dt))
sorted_files_ex = sorted(files_ex, key=lambda x: x[1])
files_by_date = defaultdict(list)
for file_attr in sorted_files_ex:
files_by_date[file_attr.date].append(file_attr.name)
for date, files in files_by_date.items():
print('{} --> {}'.format(date, files))
Output:
2019-02-18 00:00:00 --> ['LOGS\\LOG_20190218_91_02.LOG', 'LOGS\\LOG_20190218_91_05.LOG', 'LOGS\\LOG_20190218_91_00.LOG', 'LOGS\\LOG_20190218_91_22.LOG', 'LOGS\\LOG_20190218_91_10.LOG']
2019-02-19 00:00:00 --> ['LOGS\\LOG_20190219_56_22.LOG', 'LOGS\\LOG_20190219_56_24.LOG', 'LOGS\\LOG_20190219_56_25.LOG', 'LOGS\\LOG_20190219_56_26.LOG', 'LOGS\\LOG_20190219_56_03.LOG']
2017-02-19 00:00:00 --> ['LOGS\\LOG_20170219_56_26.LOG']
2018-02-19 00:00:00 --> ['LOGS\\LOG_20180219_56_26.LOG']
2019-02-20 00:00:00 --> ['LOGS\\LOG_20190220_56_22.LOG', 'LOGS\\LOG_20190220_56_07.LOG', 'LOGS\\LOG_20190220_56_13.LOG', 'LOGS\\LOG_20190220_56_17.LOG', 'LOGS\\LOG_20190220_56_21.LOG']
Upvotes: 2