Reputation: 11
I have a task of pulling images of items based on SKU and write them to an excel sheet. I can download the image fine and write it out. But issue is that when workbook.close() is called. xlsxwriter is only writing the last image. This is due to me saving space and overwriting the image after writing. Here is my write function:
def writeExcel(url, asin, imgLink, number):
if (url == -1): #incase image isn't able to be retrived
worksheet.write("A{}".format(number), asin)
worksheet.write("C{}".format(number), "N/A")
return
worksheet.write_string("A{}".format(number), asin)
imgPath = os.getcwd() + "/cache/img.jpg"
deleteCache() #remove the previous downloaded image to download the new one
getImage(imgLink) #download the image into ./cache/img.jpg
fixImage(imgPath) #fix the aspect ratio of image to fit into the cell
worksheet.insert_image("C{}".format(number), imgPath, {
"y_scale": 0.2,
"x_scale": 0.5,
"object_position": 1,
"url": url
})
It takes in the SKU of the item, and the image link. The calls getImage()
which downloads it into ./cache/img.jpg
. Then fixes the ratio with fixImage()
. Finally it writes the image to the file.
This function is called in another function's for loop for each of the SKU. Here is the function for reference.
def amazonSearch(asinList):
number = 0
for asin in asinList:
number += 1
if number % 25 == 0: #feedback to make sure it isn't stuck
print("Finished {}. Currently at {}".format(number, asin))
for region in regions:
req = requests.get(HOST.format(region, asin))
counter = 0
while (req.status_code == 503):
req = requests.get(HOST.format(region, asin))
time.sleep(1) #don't spam
counter += 1
if (counter >= 25):
break
if req.status_code == 200:
break
if (req.status_code != 200):
writeExcel(-1, asin, "", "")
continue
soup = bs(req.content, "html.parser")
imgTag = soup.find_all(id="landingImage")
imgLink = imgTag[0]["src"]
writeExcel(req.url, asin, imgLink, number)
After the script finishes. The file is written but the last SKU image will show up in all other SKUs. This is probably due to xlsxwriter only writing changes when workbook.close()
is called.
My question is how can i fix that without having to save every single image and writing at the end? As the input file is pretty big (over 8k items). I have thought of closing and reopening sheet every time writeExcel()
is called but that seems unfeasible. xlsxwriter overwrites every time so it can't be done.
Upvotes: 0
Views: 387
Reputation: 9377
insert_image
only adds the image-path or url to a buffer. Later when closing/saving the workbook, the images are loaded from paths (all same in your case) and written to output.
You can fix by reading the image-binary and inserting using image_data
:
image_file = open(filename, 'rb')
image_data = BytesIO(image_file.read())
image_file.close()
# Write the byte stream image to a cell. The filename must be specified
worksheet.insert_image('B8', filename, {'image_data': image_data})
Note: In this case, when image_data
is present, the file at path/URL of argument filename
does not need to exist. So you can treat filename
argument rather as identifier or URI.
Since you are reading from the same cached file, your filename
passed to insert_image
as argument can be made unique by using some distinctive attribute like:
For example:
filename_to_insert = asin + filename
or filename_to_insert = url
See:
Upvotes: 1