Maz
Maz

Reputation: 11

xlsxwriter image get overwritten when writing the files

I have a task of pulling images of items based on SKU and write them to an excel sheet. I can download the image fine and write it out. But issue is that when workbook.close() is called. xlsxwriter is only writing the last image. This is due to me saving space and overwriting the image after writing. Here is my write function:

def writeExcel(url, asin, imgLink, number):
    if (url == -1): #incase image isn't able to be retrived
        worksheet.write("A{}".format(number), asin)
        worksheet.write("C{}".format(number), "N/A")
        return
    worksheet.write_string("A{}".format(number), asin)
    imgPath = os.getcwd() + "/cache/img.jpg"
    deleteCache() #remove the previous downloaded image to download the new one
    getImage(imgLink) #download the image into ./cache/img.jpg
    fixImage(imgPath) #fix the aspect ratio of image to fit into the cell
    worksheet.insert_image("C{}".format(number), imgPath, {
        "y_scale": 0.2,
        "x_scale": 0.5,
        "object_position": 1,
        "url": url
    })

It takes in the SKU of the item, and the image link. The calls getImage() which downloads it into ./cache/img.jpg. Then fixes the ratio with fixImage(). Finally it writes the image to the file.

This function is called in another function's for loop for each of the SKU. Here is the function for reference.

def amazonSearch(asinList):
    number = 0
    for asin in asinList:
        number += 1
        if number % 25 == 0:  #feedback to make sure it isn't stuck
            print("Finished {}. Currently at {}".format(number, asin))
        for region in regions:
            req = requests.get(HOST.format(region, asin))
            counter = 0
            while (req.status_code == 503):
                req = requests.get(HOST.format(region, asin))
                time.sleep(1)  #don't spam
                counter += 1
                if (counter >= 25):
                    break
            if req.status_code == 200:
                break
        if (req.status_code != 200):
            writeExcel(-1, asin, "", "")
            continue
        soup = bs(req.content, "html.parser")
        imgTag = soup.find_all(id="landingImage")
        imgLink = imgTag[0]["src"]
        writeExcel(req.url, asin, imgLink, number)

After the script finishes. The file is written but the last SKU image will show up in all other SKUs. This is probably due to xlsxwriter only writing changes when workbook.close() is called.

My question is how can i fix that without having to save every single image and writing at the end? As the input file is pretty big (over 8k items). I have thought of closing and reopening sheet every time writeExcel() is called but that seems unfeasible. xlsxwriter overwrites every time so it can't be done.

Upvotes: 0

Views: 387

Answers (1)

hc_dev
hc_dev

Reputation: 9377

insert_image only adds the image-path or url to a buffer. Later when closing/saving the workbook, the images are loaded from paths (all same in your case) and written to output.

You can fix by reading the image-binary and inserting using image_data:

image_file = open(filename, 'rb')
image_data = BytesIO(image_file.read())
image_file.close()

# Write the byte stream image to a cell. The filename must  be specified
worksheet.insert_image('B8', filename, {'image_data': image_data})

Note: In this case, when image_data is present, the file at path/URL of argument filename does not need to exist. So you can treat filename argument rather as identifier or URI.

Since you are reading from the same cached file, your filename passed to insert_image as argument can be made unique by using some distinctive attribute like:

  • asin
  • url

For example: filename_to_insert = asin + filename or filename_to_insert = url

See:

Upvotes: 1

Related Questions