Reputation: 13062
I have read the names of all of the files in a directory in a python list like this:
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
What I want to do is group similar files as tuples in the list. The above example should look like
files_grouped = ['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
One way I have tried is to separate the elements I need to group from the list like so
groups = tuple([file for file in files if '_' in file])
single = [file for file in files if not '_' in file]
And I would create a new list appending the both. But how do I create the groups
as list of tuple for ch3
and ch4
like [('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
instead of one big tuple?
Upvotes: 2
Views: 79
Reputation: 14987
You could use a dictionary (or, for simpler initialising a collections.defaultdict
:
from collections import defaultdict
from pprint import pprint
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
grouped = defaultdict(list) # create an empty list for not existent entries
for f in files:
key = f[:3]
grouped[key].append(f)
pprint(grouped)
Result:
defaultdict(<class 'list'>,
{'ch1': ['ch1.txt'],
'ch2': ['ch2.txt'],
'ch3': ['ch3_1.txt', 'ch3_2.txt'],
'ch4': ['ch4_2.txt', 'ch4_1.txt']})
If you want your list of tuples, you can do:
grouped = [tuple(l) for l in grouped.values()]
Which is
[('ch1.txt',),
('ch2.txt',),
('ch3_1.txt', 'ch3_2.txt'),
('ch4_2.txt', 'ch4_1.txt')]
Upvotes: 2
Reputation: 402333
None of the answers give you a generic solution that works for any kind of file names. I think you should be using regex, if you want to account for that.
import itertools
import re
sorted_files = sorted(files, key=lambda x: re.findall('(\d+)_(\d+)', x))
out = [list(g) for _, g in itertools.groupby(sorted_files,
key=lambda x: re.search('\d+', x).group() )]
print(out)
[['ch1.txt'],
['ch2.txt'],
['ch3_1.txt', 'ch3_2.txt'],
['ch4_1.txt', 'ch4_2.txt']]
Note that this should work for any naming format, not just chX_X
.
If you want your output in the exact format described, you could do a little extra post-processing:
out = [o[0] if len(o) == 1 else tuple(o) for o in out]
print(out)
['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
Regex Details
The first regex sorts by chapter section and subsection.
( # first group
\d+ # 1 or more digits
)
_ # literal underscore
( # second group
\d+ # 1 or more digits
)
The second regex groups by chapter sections only - all chapters with the same section are grouped together.
Upvotes: 3
Reputation: 11477
Maybe you can sort the list of file name, and then use groupby() to do this:
e.g.
from itertools import groupby
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
print([tuple(g) for k,g in groupby(sorted(files),key=lambda x : x[:-4].split("_")[0])])
Result:
[('ch1.txt',), ('ch2.txt',), ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
Hope this helps.
Upvotes: 1