jfran
jfran

Reputation: 143

Filter through Python list to find common elements

I'm trying to find an efficient way to loop through elements in a list, and to group common elements together into another list, grouplist.

Exmaple

In[]: grouplist = []

In[]: filelist
Out[]:['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

I want to find a group of common patterns, with different integers. So in this case,

First iteration grouplist =

 C:\\West-California-North-10.xlsx
 C:\\West-California-North-5.xlsx

Second Iteration =

 C:\\West-California-East-1.xlsx

Third Iteration =

 C:\\West-California-South-1.xlsx
 C:\\West-California-South-3.xlsx

Upvotes: 2

Views: 584

Answers (5)

Quinn
Quinn

Reputation: 4504

Here is another way of using regex and itertools.groupby:

import re
from itertools import groupby

filelist =  ['C:\\West-California-North-10.xlsx', 
            'C:\\West-California-North-5.xlsx', 
            'C:\\West-California-East-1.xlsx', 
            'C:\\West-California-South-1.xlsx',
            'C:\\South-California-North-5.xlsx',
            'C:\\West-California-South-3.xlsx']

keyfunc = lambda x: re.match('(.*)-\d+\.xlsx', x).group(1)    
keys = [ keyfunc(f) for f in filelist]
grouplist = [list(v) for k,v in groupby(sorted(filelist), key = keyfunc)][::-1]
for group in grouplist: print group, '\r\n'

The output:

['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx'] 

['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx'] 

['C:\\West-California-East-1.xlsx'] 

['C:\\South-California-North-5.xlsx'] 

Upvotes: 1

Learner
Learner

Reputation: 5302

What about using sorted and regex- You can modify and will have more control over this sorting- just change the sorter function.

import re

d = ['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-3.xlsx',
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2,num
dd =  sorted(d,key=sorter)

for t in dd:
    print t

Output-

C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx

example of customization of the sorter function-

If you change the sorter function as below i.e discard sorting based on the number-

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2# omitted num here

Then output-

C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

Working proof

After all you can iterate over them as you want as below-

import re
from collections import defaultdict,OrderedDict

d = ['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-3.xlsx',
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

group_data = defaultdict(list)

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2,num
dd =  sorted(d,key=sorter)

for t in dd:
    key = re.findall(r'([^\d]+)\d',t)[0]
    group_data[key].append(t)

dt = OrderedDict(sorted(group_data.items(),key=lambda x: x[0]))
for it in dt:
    print '\n'.join(dt[it])+'\n'

Output-

C:\South-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180550

use a defaultdict:

from collections import defaultdict
d = defaultdict(set)

for fle in l:
    k, rest = fle.rsplit("-", 1)
    d[k].add("{}-{}".format(k, rest))

for k,v in d.items():
    print "\n".join(v)
    print

Output:

C:\West-California-East-1.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\South-California-North-5.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

If you want to preserve the order the elements are first seen use an OrderedDict:

from collections import OrderedDict
d = OrderedDict()

for fle in l:
    k, rest = fle.rsplit("-", 1)
    d.setdefault(k,set()).add("{}-{}".format(k, rest))

for k,v in d.items():
    print "\n".join(v)
    print

Output:

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

C:\South-California-North-5.xlsx

If you don't have digits in the name you can also str.translate instead of splitting:

from collections import defaultdict
d = defaultdict(set)

for fle in l:
    d[fle.translate(None,"0123456789")].add(fle)

for k,v in d.items():
    print "\n".join(v)
    print

Output:

C:\West-California-East-1.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

C:\South-California-North-5.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107357

You can use a dictionary in order to categorize your paths based on the location name.

For separating the location name from the trailing id you can use str.rsplit() then use dict.setdefault() method by passing a set() object in it in order to preserve the unique names:

>>> lst=['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx','C:\\West-California-East-1.xlsx','C:\\West-California-South-1.xlsx','C:\\South-California-North-5.xlsx','C:\\West-California-South-3.xlsx']

>>> d = {}
>>> new = [path.rsplit('-',1) for path in lst]

>>> for i,j in new:
...     d.setdefault(i,set()).add(i+'-'+j)
... 

>>> d.values()
[set(['C:\\West-California-East-1.xlsx']),
 set(['C:\\West-California-North-10.xlsx','C:\\West-California-North-5.xlsx']), 
 set(['C:\\South-California-North-5.xlsx']),
 set(['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx'])]
>>> 

Upvotes: 1

PaulMcG
PaulMcG

Reputation: 63802

itertools.groupby is your friend:

from itertools import groupby


filelist = [
    'C:\\West-California-North-10.xlsx', 
    'C:\\West-California-North-5.xlsx', 
    'C:\\West-California-East-1.xlsx', 
    'C:\\West-California-South-1.xlsx',
    'C:\\South-California-North-5.xlsx',
    'C:\\West-California-South-3.xlsx']

key_fn = lambda s: s.rsplit('-',1)[0]

# before grouping, list has to be sorted
filelist = sorted(filelist, key=key_fn)

# usually use the same key_fn for grouping as was used for sorting
for key, grouped_file_names in groupby(filelist, key=key_fn):
    # groupby returns an iterator of tuples
    # the first element of the tuple is the grouped key value
    # the second element is a generator of the items that matched that key
    # (YOU MUST CONSUME THIS GENERATOR BEFORE MOVING ON TO THE NEXT KEY)
    print '\n'.join(list(grouped_file_names))
    print

prints

C:\South-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

Upvotes: 2

Related Questions