Reputation: 690
I am trying to label multiple images by brand -> product -> each product image. Since it takes a bit of time to label each image one at a time, I decided to use multiprocessing to speed up the job. I tried using multiprocessing, it definitely speeds up labeling the images, but the code doesn't work how I intended it to.
Code:
def multiprocessing_func(line):
json_line = json.loads(line)
product = json_line['groupid']
active_urls = set(json_line['urls'])
try:
active_urls.remove(brand_dic[brand])
except:
pass
if product in saved_product_dict and active_urls == saved_product_dict[product]:
keep_products.append(product)
print('True')
else:
with open(new_images_filename, 'a') as save_file:
labels = label_product_images(line)
save_file.write('{}\n'.format(json.dumps(labels)))
print('False')
active_images_filename = 'data/input/image_urls.json'
new_images_filename = 'data/output/new_labeled_images.json'
saved_images_filename = 'data/output/saved_labeled_images.json'
brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
if __name__ == '__main__':
brands = ['a', 'b', 'c']
for brand in brands:
active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
print(new_images_filename)
with open(new_images_filename, 'w'): pass
saved_product_dict = {}
with open(saved_images_filename) as in_file:
for line in in_file:
json_line = json.loads(line)
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
saved_product_dict[json_line['groupid']] = set(saved_urls)
print(saved_product_dict)
keep_products = []
labels_list = []
with open(active_images_filename, 'r') as in_file:
processes = []
for line in in_file:
p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
processes.append(p)
p.start()
print('complete stage 1')
for i in range(0,2):
print('running stage 2')
Output:
data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False
I noticed that the multiprocessing step runs last and skips codes, and I'm not sure why it does this. Also I'm not sure why it didn't run the first part, when I tried printing "saved_product_dict", the dictionary came up empty.
I have code before and after the multiprocessing step that run before it. My question is how to I force the multiprocessing step to run in the order that I have written my code. Any explanation on what's going would be greatly appreciated. I'm new to using multiprocessing, I'm still learning how it works.
Upvotes: 0
Views: 33
Reputation: 151
This line seems to be wrong. Try to change it
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
with:
saved_urls = [url for urls_list in json_line['urls]]
This might be the solution for the first part of your question.
About printing of the multiprocessing part and the main thread of the program. The print order does not always a correct indicator of the run time of the functions/scripts in async environments(here different processes exists). If you want to run your scripts in a defined order you need to implement synchronization mechanism using semaphores and mutexes, or you wait for all processes to exit before moving to stage 2, which was the main concern of you i assume.
Upvotes: 1