Reputation: 143
I'm trying to find an efficient way to loop through elements in a list, and to group common elements together into another list, grouplist.
Exmaple
In[]: grouplist = []
In[]: filelist
Out[]:['C:\\West-California-North-10.xlsx',
'C:\\West-California-North-5.xlsx',
'C:\\West-California-East-1.xlsx',
'C:\\West-California-South-1.xlsx',
'C:\\South-California-North-5.xlsx',
'C:\\West-California-South-3.xlsx']
I want to find a group of common patterns, with different integers. So in this case,
First iteration grouplist =
C:\\West-California-North-10.xlsx
C:\\West-California-North-5.xlsx
Second Iteration =
C:\\West-California-East-1.xlsx
Third Iteration =
C:\\West-California-South-1.xlsx
C:\\West-California-South-3.xlsx
Upvotes: 2
Views: 584
Reputation: 4504
Here is another way of using regex and itertools.groupby:
import re
from itertools import groupby
filelist = ['C:\\West-California-North-10.xlsx',
'C:\\West-California-North-5.xlsx',
'C:\\West-California-East-1.xlsx',
'C:\\West-California-South-1.xlsx',
'C:\\South-California-North-5.xlsx',
'C:\\West-California-South-3.xlsx']
keyfunc = lambda x: re.match('(.*)-\d+\.xlsx', x).group(1)
keys = [ keyfunc(f) for f in filelist]
grouplist = [list(v) for k,v in groupby(sorted(filelist), key = keyfunc)][::-1]
for group in grouplist: print group, '\r\n'
The output:
['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx']
['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx']
['C:\\West-California-East-1.xlsx']
['C:\\South-California-North-5.xlsx']
Upvotes: 1
Reputation: 5302
What about using sorted
and regex
- You can modify and will have more control over this sorting- just change the sorter
function.
import re
d = ['C:\\West-California-North-10.xlsx',
'C:\\West-California-North-5.xlsx',
'C:\\West-California-East-1.xlsx',
'C:\\West-California-South-3.xlsx',
'C:\\West-California-South-1.xlsx',
'C:\\South-California-North-5.xlsx',
'C:\\West-California-South-3.xlsx']
def sorter(s):
direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
return direction1,direction2,num
dd = sorted(d,key=sorter)
for t in dd:
print t
Output-
C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx
example of customization of the sorter
function-
If you change the sorter function as below i.e discard sorting based on the number-
def sorter(s):
direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
return direction1,direction2# omitted num here
Then output-
C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
After all you can iterate over them as you want as below-
import re
from collections import defaultdict,OrderedDict
d = ['C:\\West-California-North-10.xlsx',
'C:\\West-California-North-5.xlsx',
'C:\\West-California-East-1.xlsx',
'C:\\West-California-South-3.xlsx',
'C:\\West-California-South-1.xlsx',
'C:\\South-California-North-5.xlsx',
'C:\\West-California-South-3.xlsx']
group_data = defaultdict(list)
def sorter(s):
direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
return direction1,direction2,num
dd = sorted(d,key=sorter)
for t in dd:
key = re.findall(r'([^\d]+)\d',t)[0]
group_data[key].append(t)
dt = OrderedDict(sorted(group_data.items(),key=lambda x: x[0]))
for it in dt:
print '\n'.join(dt[it])+'\n'
Output-
C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx
Upvotes: 1
Reputation: 180550
use a defaultdict
:
from collections import defaultdict
d = defaultdict(set)
for fle in l:
k, rest = fle.rsplit("-", 1)
d[k].add("{}-{}".format(k, rest))
for k,v in d.items():
print "\n".join(v)
print
Output:
C:\West-California-East-1.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\South-California-North-5.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
If you want to preserve the order the elements are first seen use an OrderedDict
:
from collections import OrderedDict
d = OrderedDict()
for fle in l:
k, rest = fle.rsplit("-", 1)
d.setdefault(k,set()).add("{}-{}".format(k, rest))
for k,v in d.items():
print "\n".join(v)
print
Output:
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\South-California-North-5.xlsx
If you don't have digits in the name you can also str.translate
instead of splitting:
from collections import defaultdict
d = defaultdict(set)
for fle in l:
d[fle.translate(None,"0123456789")].add(fle)
for k,v in d.items():
print "\n".join(v)
print
Output:
C:\West-California-East-1.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\South-California-North-5.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
Upvotes: 1
Reputation: 107357
You can use a dictionary in order to categorize your paths based on the location name.
For separating the location name from the trailing id you can use str.rsplit()
then use dict.setdefault()
method by passing a set()
object in it in order to preserve the unique names:
>>> lst=['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx','C:\\West-California-East-1.xlsx','C:\\West-California-South-1.xlsx','C:\\South-California-North-5.xlsx','C:\\West-California-South-3.xlsx']
>>> d = {}
>>> new = [path.rsplit('-',1) for path in lst]
>>> for i,j in new:
... d.setdefault(i,set()).add(i+'-'+j)
...
>>> d.values()
[set(['C:\\West-California-East-1.xlsx']),
set(['C:\\West-California-North-10.xlsx','C:\\West-California-North-5.xlsx']),
set(['C:\\South-California-North-5.xlsx']),
set(['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx'])]
>>>
Upvotes: 1
Reputation: 63802
itertools.groupby
is your friend:
from itertools import groupby
filelist = [
'C:\\West-California-North-10.xlsx',
'C:\\West-California-North-5.xlsx',
'C:\\West-California-East-1.xlsx',
'C:\\West-California-South-1.xlsx',
'C:\\South-California-North-5.xlsx',
'C:\\West-California-South-3.xlsx']
key_fn = lambda s: s.rsplit('-',1)[0]
# before grouping, list has to be sorted
filelist = sorted(filelist, key=key_fn)
# usually use the same key_fn for grouping as was used for sorting
for key, grouped_file_names in groupby(filelist, key=key_fn):
# groupby returns an iterator of tuples
# the first element of the tuple is the grouped key value
# the second element is a generator of the items that matched that key
# (YOU MUST CONSUME THIS GENERATOR BEFORE MOVING ON TO THE NEXT KEY)
print '\n'.join(list(grouped_file_names))
print
prints
C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
Upvotes: 2