Reputation: 357
I have a list of urls that have similar pattern like this:
['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg',
'../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg',
'../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']
The first part is similar '../abc/def/xyz/' is similar to all urls. I want to group the links with similar ID to dicts, something like this:
{"0008c5398": ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg'],
"000a290e4": [ '../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg'],
"000fb9572": [ '../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']
}
Any hints? Many thanks in advance...
Upvotes: 0
Views: 124
Reputation: 2395
Here's a simple solution of just iterating over the list and appending to a dictionary.
import os
import pprint
a = ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg',
'../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg',
'../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']
url_dict = {}
for url in a:
id = os.path.split(url)[-1].split('-')[0]
if id not in url_dict.keys():
url_dict[id] = [url]
else:
url_dict[id].append(url)
pprint.pprint(url_dict)
Output:
{'0008c5398': ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg'],
'000a290e4': ['../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg'],
'000fb9572': ['../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']}
Upvotes: 1
Reputation: 1781
First at first, you need a function to get the key out from your items. you can use re
.
In [106]: import re
In [107]: pat = r'.*?xyz//(.*)-.*'
In [108]: match = re.search(pat, '../abc/def/xyz//0008c5398-4.jpg')
In [109]: match.group(1)
Out[109]: '0008c5398'
And then , you need a loop to check every item, and do the same thing as above. and to make it simple, you can use defaultdict
.
In [110]: from collections import defaultdict
In [111]: d = defaultdict(set)
In [119]: for i in sample:
...: pat = r'.*?xyz//(.*)-.*'
...: match = re.search(pat, i)
...: if not match:
...: continue
...: key = match.group(1)
...: d[key].add(i)
...:
...:
In [120]: d
Out[120]:
defaultdict(set,
{'0008c5398': {'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg'},
'000a290e4': {'../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg'},
'000fb9572': {'../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg'}})
I not sure if you have some spelling mistake in your first item of your sample data. /def/xyz/
is different with the other items, if you no sure, just change the re
pattern by remove the /
from pat
as you need.
Upvotes: 1
Reputation: 107124
You can keep appending the URLs to a dict of lists using dict.setdefault
to initialize each new key with a list (assuming your list of URLs is stored as l
):
d = {}
for i in l:
d.setdefault(i.split('/')[-1].split('-')[0], []).append(i)
d
becomes:
{'0008c5398': ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg'],
'000a290e4': ['../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg'],
'000fb9572': ['../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']}
Upvotes: 1
Reputation: 135
Alternatively, you could also do a simple split and get the last item of each URL to get the image name before spliting the name again to get the image ID.
After which you can check if the image ID exists in your result dictionary or not and append it to the dictionary entry accordingly.
inputURLs = ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg',
'../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg',
'../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']
resultDict = {}
for inputUrl in inputURLs :
imageName = inputUrl.split('/')[-1]
imageId = imageName.split('-')[0]
if imageId in resultDict :
resultDict[imageId].append(inputUrl)
else :
resultDict[imageId] = [inputUrl]
Upvotes: 1
Reputation: 71471
You can use itertools.groupby
:
import re
from itertools import groupby
d = ['../abc/def/xyz/0008c5398-1.jpg', '../abc/def/xyz//0008c5398-2.jpg', '../abc/def/xyz//0008c5398-3.jpg', '../abc/def/xyz//0008c5398-4.jpg', '../abc/def/xyz//0008c5398-5.jpg', '../abc/def/xyz//000a290e4-1.jpg', '../abc/def/xyz//000a290e4-2.jpg', '../abc/def/xyz//000fb9572-1.jpg', '../abc/def/xyz//000fb9572-2.jpg', '../abc/def/xyz//000fb9572-3.jpg', '../abc/def/xyz//000fb9572-4.jpg']
_d = [[re.findall('\w+(?=\-\d)', i)[0], i] for i in d]
result = {a:[c for _, c in b] for a,b in groupby(sorted(_d, key=lambda x:x[0]), key=lambda x:x[0])}
Output:
{
"0008c5398": [
"../abc/def/xyz/0008c5398-1.jpg",
"../abc/def/xyz//0008c5398-2.jpg",
"../abc/def/xyz//0008c5398-3.jpg",
"../abc/def/xyz//0008c5398-4.jpg",
"../abc/def/xyz//0008c5398-5.jpg"
],
"000a290e4": [
"../abc/def/xyz//000a290e4-1.jpg",
"../abc/def/xyz//000a290e4-2.jpg"
],
"000fb9572": [
"../abc/def/xyz//000fb9572-1.jpg",
"../abc/def/xyz//000fb9572-2.jpg",
"../abc/def/xyz//000fb9572-3.jpg",
"../abc/def/xyz//000fb9572-4.jpg"
]
}
Upvotes: 1
Reputation: 31416
Look into regular expressions. An approach would be to match the URLs against a regex and store the results in a dictionary that use a numbered group in the match as a key and add the URL to the value:
import re
urls = ['../abc/def/xyz/0008c5398-1.jpg',
'../abc/def/xyz//0008c5398-2.jpg',
'../abc/def/xyz//0008c5398-3.jpg',
'../abc/def/xyz//0008c5398-4.jpg',
'../abc/def/xyz//0008c5398-5.jpg',
'../abc/def/xyz//000a290e4-1.jpg',
'../abc/def/xyz//000a290e4-2.jpg',
'../abc/def/xyz//000fb9572-1.jpg',
'../abc/def/xyz//000fb9572-2.jpg',
'../abc/def/xyz//000fb9572-3.jpg',
'../abc/def/xyz//000fb9572-4.jpg']
result = {}
rgx = re.compile(r"\.\./abc/def/xyz//(.*)-\d+.jpg")
for url in urls:
match = rgx.search(url)
if match:
key = match.group(1)
if key not in result:
result[key] = []
result[key] += [url]
else:
print(f'This did not match: {url}')
Upvotes: 1