Ha An Tran
Ha An Tran

Reputation: 357

Split a list of urls with similar pattern into dicts

I have a list of urls that have similar pattern like this:

['../abc/def/xyz/0008c5398-1.jpg',
 '../abc/def/xyz//0008c5398-2.jpg',
 '../abc/def/xyz//0008c5398-3.jpg',
 '../abc/def/xyz//0008c5398-4.jpg',
 '../abc/def/xyz//0008c5398-5.jpg',
 '../abc/def/xyz//000a290e4-1.jpg',
 '../abc/def/xyz//000a290e4-2.jpg',
 '../abc/def/xyz//000fb9572-1.jpg',
 '../abc/def/xyz//000fb9572-2.jpg',
 '../abc/def/xyz//000fb9572-3.jpg',
 '../abc/def/xyz//000fb9572-4.jpg']

The first part is similar '../abc/def/xyz/' is similar to all urls. I want to group the links with similar ID to dicts, something like this:

{"0008c5398": ['../abc/def/xyz/0008c5398-1.jpg',
 '../abc/def/xyz//0008c5398-2.jpg',
 '../abc/def/xyz//0008c5398-3.jpg',
 '../abc/def/xyz//0008c5398-4.jpg',
 '../abc/def/xyz//0008c5398-5.jpg'],
"000a290e4": [ '../abc/def/xyz//000a290e4-1.jpg',
 '../abc/def/xyz//000a290e4-2.jpg'],
"000fb9572": [ '../abc/def/xyz//000fb9572-1.jpg',
 '../abc/def/xyz//000fb9572-2.jpg',
 '../abc/def/xyz//000fb9572-3.jpg',
 '../abc/def/xyz//000fb9572-4.jpg']
}

Any hints? Many thanks in advance...

Upvotes: 0

Views: 124

Answers (6)

Haran Rajkumar
Haran Rajkumar

Reputation: 2395

Here's a simple solution of just iterating over the list and appending to a dictionary.

import os
import pprint
a = ['../abc/def/xyz/0008c5398-1.jpg',
         '../abc/def/xyz//0008c5398-2.jpg',
          '../abc/def/xyz//0008c5398-3.jpg',
           '../abc/def/xyz//0008c5398-4.jpg',
            '../abc/def/xyz//0008c5398-5.jpg',
             '../abc/def/xyz//000a290e4-1.jpg',
              '../abc/def/xyz//000a290e4-2.jpg',
               '../abc/def/xyz//000fb9572-1.jpg',
                '../abc/def/xyz//000fb9572-2.jpg',
                 '../abc/def/xyz//000fb9572-3.jpg',
                  '../abc/def/xyz//000fb9572-4.jpg']
url_dict = {}
for url in a:
    id = os.path.split(url)[-1].split('-')[0]
    if id not in url_dict.keys():
       url_dict[id] = [url]
    else:
        url_dict[id].append(url)

pprint.pprint(url_dict)

Output:

{'0008c5398': ['../abc/def/xyz/0008c5398-1.jpg',
               '../abc/def/xyz//0008c5398-2.jpg',
               '../abc/def/xyz//0008c5398-3.jpg',
               '../abc/def/xyz//0008c5398-4.jpg',
               '../abc/def/xyz//0008c5398-5.jpg'],
 '000a290e4': ['../abc/def/xyz//000a290e4-1.jpg',
               '../abc/def/xyz//000a290e4-2.jpg'],
 '000fb9572': ['../abc/def/xyz//000fb9572-1.jpg',
               '../abc/def/xyz//000fb9572-2.jpg',
               '../abc/def/xyz//000fb9572-3.jpg',
               '../abc/def/xyz//000fb9572-4.jpg']}

Upvotes: 1

Frank AK
Frank AK

Reputation: 1781

First at first, you need a function to get the key out from your items. you can use re.

In [106]: import re                                                                                                                                                                                                                                                             

In [107]: pat = r'.*?xyz//(.*)-.*'                                                                                                                                                                                                                                              

In [108]: match = re.search(pat, '../abc/def/xyz//0008c5398-4.jpg')                                                                                                                                                                                                             

In [109]: match.group(1)                                                                                                                                                                                                                                                        
Out[109]: '0008c5398'

And then , you need a loop to check every item, and do the same thing as above. and to make it simple, you can use defaultdict.

In [110]: from collections import defaultdict                                                                                                                                                                                                                                   

In [111]: d = defaultdict(set)   

In [119]: for i in sample: 
     ...:     pat = r'.*?xyz//(.*)-.*' 
     ...:     match = re.search(pat, i) 
     ...:     if not match: 
     ...:         continue 
     ...:     key = match.group(1) 
     ...:     d[key].add(i) 
     ...:      
     ...:                                                                                                                                                                                                                                                                       

In [120]: d                                                                                                                                                                                                                                                                     
Out[120]: 
defaultdict(set,
            {'0008c5398': {'../abc/def/xyz//0008c5398-2.jpg',
              '../abc/def/xyz//0008c5398-3.jpg',
              '../abc/def/xyz//0008c5398-4.jpg',
              '../abc/def/xyz//0008c5398-5.jpg'},
             '000a290e4': {'../abc/def/xyz//000a290e4-1.jpg',
              '../abc/def/xyz//000a290e4-2.jpg'},
             '000fb9572': {'../abc/def/xyz//000fb9572-1.jpg',
              '../abc/def/xyz//000fb9572-2.jpg',
              '../abc/def/xyz//000fb9572-3.jpg',
              '../abc/def/xyz//000fb9572-4.jpg'}})

I not sure if you have some spelling mistake in your first item of your sample data. /def/xyz/ is different with the other items, if you no sure, just change the re pattern by remove the / from pat as you need.

Upvotes: 1

blhsing
blhsing

Reputation: 107124

You can keep appending the URLs to a dict of lists using dict.setdefault to initialize each new key with a list (assuming your list of URLs is stored as l):

d = {}
for i in l:
    d.setdefault(i.split('/')[-1].split('-')[0], []).append(i)

d becomes:

{'0008c5398': ['../abc/def/xyz/0008c5398-1.jpg',
               '../abc/def/xyz//0008c5398-2.jpg',
               '../abc/def/xyz//0008c5398-3.jpg',
               '../abc/def/xyz//0008c5398-4.jpg',
               '../abc/def/xyz//0008c5398-5.jpg'],
 '000a290e4': ['../abc/def/xyz//000a290e4-1.jpg',
               '../abc/def/xyz//000a290e4-2.jpg'],
 '000fb9572': ['../abc/def/xyz//000fb9572-1.jpg',
               '../abc/def/xyz//000fb9572-2.jpg',
               '../abc/def/xyz//000fb9572-3.jpg',
               '../abc/def/xyz//000fb9572-4.jpg']}

Upvotes: 1

shikai ng
shikai ng

Reputation: 135

Alternatively, you could also do a simple split and get the last item of each URL to get the image name before spliting the name again to get the image ID.

After which you can check if the image ID exists in your result dictionary or not and append it to the dictionary entry accordingly.

    inputURLs = ['../abc/def/xyz/0008c5398-1.jpg',
                 '../abc/def/xyz//0008c5398-2.jpg',
                 '../abc/def/xyz//0008c5398-3.jpg',
                 '../abc/def/xyz//0008c5398-4.jpg',
                 '../abc/def/xyz//0008c5398-5.jpg',
                 '../abc/def/xyz//000a290e4-1.jpg',
                 '../abc/def/xyz//000a290e4-2.jpg',
                 '../abc/def/xyz//000fb9572-1.jpg',
                 '../abc/def/xyz//000fb9572-2.jpg',
                 '../abc/def/xyz//000fb9572-3.jpg',
                 '../abc/def/xyz//000fb9572-4.jpg']

    resultDict =  {}

    for inputUrl in inputURLs :
        imageName = inputUrl.split('/')[-1]
        imageId = imageName.split('-')[0]
        if imageId in resultDict :
            resultDict[imageId].append(inputUrl)
        else :
            resultDict[imageId] = [inputUrl]

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71471

You can use itertools.groupby:

import re
from itertools import groupby
d = ['../abc/def/xyz/0008c5398-1.jpg', '../abc/def/xyz//0008c5398-2.jpg', '../abc/def/xyz//0008c5398-3.jpg', '../abc/def/xyz//0008c5398-4.jpg', '../abc/def/xyz//0008c5398-5.jpg', '../abc/def/xyz//000a290e4-1.jpg', '../abc/def/xyz//000a290e4-2.jpg', '../abc/def/xyz//000fb9572-1.jpg', '../abc/def/xyz//000fb9572-2.jpg', '../abc/def/xyz//000fb9572-3.jpg', '../abc/def/xyz//000fb9572-4.jpg']
_d = [[re.findall('\w+(?=\-\d)', i)[0], i] for i in d]
result = {a:[c for _, c in b] for a,b in groupby(sorted(_d, key=lambda x:x[0]), key=lambda x:x[0])}

Output:

{
 "0008c5398": [
    "../abc/def/xyz/0008c5398-1.jpg",
    "../abc/def/xyz//0008c5398-2.jpg",
    "../abc/def/xyz//0008c5398-3.jpg",
    "../abc/def/xyz//0008c5398-4.jpg",
    "../abc/def/xyz//0008c5398-5.jpg"
 ],
 "000a290e4": [
    "../abc/def/xyz//000a290e4-1.jpg",
    "../abc/def/xyz//000a290e4-2.jpg"
 ],
 "000fb9572": [
    "../abc/def/xyz//000fb9572-1.jpg",
    "../abc/def/xyz//000fb9572-2.jpg",
    "../abc/def/xyz//000fb9572-3.jpg",
    "../abc/def/xyz//000fb9572-4.jpg"
   ]
}

Upvotes: 1

Grismar
Grismar

Reputation: 31416

Look into regular expressions. An approach would be to match the URLs against a regex and store the results in a dictionary that use a numbered group in the match as a key and add the URL to the value:

import re


urls = ['../abc/def/xyz/0008c5398-1.jpg',
        '../abc/def/xyz//0008c5398-2.jpg',
        '../abc/def/xyz//0008c5398-3.jpg',
        '../abc/def/xyz//0008c5398-4.jpg',
        '../abc/def/xyz//0008c5398-5.jpg',
        '../abc/def/xyz//000a290e4-1.jpg',
        '../abc/def/xyz//000a290e4-2.jpg',
        '../abc/def/xyz//000fb9572-1.jpg',
        '../abc/def/xyz//000fb9572-2.jpg',
        '../abc/def/xyz//000fb9572-3.jpg',
        '../abc/def/xyz//000fb9572-4.jpg']


result = {}

rgx = re.compile(r"\.\./abc/def/xyz//(.*)-\d+.jpg")
for url in urls:
    match = rgx.search(url)
    if match:
        key = match.group(1)
        if key not in result:
            result[key] = []
        result[key] += [url]
    else:
        print(f'This did not match: {url}')

Upvotes: 1

Related Questions