Reputation: 493
I have an Excel sheet with 100 rows. Each one has various informations, including an id, and a cell containing a photo.
I use pandas to load the data into dictionaries :
import pandas as pd
df = pd.read_excel('myfile.xlsx')
data = []
for index,row in df.iterrows():
data.append({
'id':row['id'],
'field2':row['field2'],
'field3':row['field3']
})
For the image column, I want to extract each image, name it with the id of the row (image_row['id'].jpg) and put it into a folder. Then, I want to store the path to the image as below :
for index,row in df.iterrows():
data.append({
'id':row['id'],
'field2':row['field2'],
'field3':row['field3'],
'image':'path/image_'+row['id']+'.jpg'
})
I'm looking for a way to do that, or another way if better. Do you have any idea ?
I'm on Linux, so I can't use this method with pywin32.
Thanks a lot
-- EDIT
You can find here an exemple of sheet i use
Upvotes: 8
Views: 35628
Reputation: 87
def extract_images_from_excel(path, dir_extract=None):
"""extracts images from excel and names then with enumerated filename
Args:
path: pathlib.Path, excel filepath
dir_extract: pathlib.Path, default=None, defaults to same dir as excel file
Returns:
new_paths: list[pathlib.Path], list of paths to the extracted images
"""
if type(path) is str:
path = pathlib.Path(path)
if dir_extract is None:
dir_extract = path.parent
if path.suffix != '.xlsx':
raise ValueError('path must be an xlsx file')
name = path.name.replace(''.join(path.suffixes), '').replace(' ', '') # name of excel file without suffixes
temp_file = pathlib.Path(source_file).parent / 'temp.xlsx' # temp xlsx
temp_zip = temp_file.with_suffix('.zip') # temp zip
shutil.copyfile(source_file, temp_file)
temp_file.rename(str(temp_zip))
extract_dir = temp_file.parent / 'temp'
extract_dir.mkdir(exist_ok=True)
shutil.unpack_archive(temp_zip, extract_dir) # unzip xlsx zip file
paths_img = sorted((extract_dir / 'xl' / 'media').glob('*.png')) # find images
move_paths = {path: destination_dir / (name + f'-{str(n)}.png') for n, path in enumerate(paths_img)} # create move path dict
new_paths = [shutil.move(old, new) for old, new in move_paths.items()] # move / rename image files
shutil.rmtree(extract_dir) # delete temp folder
temp_zip.unlink() # delete temp file
return new_paths
the above ^ does the following:
doesn't require 3rd party packages and doesn't need windows to run
Upvotes: 7
Reputation: 27466
You can unzip the renamed xlsx zip file.
$ cp a.xlsx a.zip
$ unzip a.zip
$ ls -al xl/media
Upvotes: 2
Reputation: 493
I found a solution using openpyxl and openpyxl-image-loader modules
# installing the modules
pip3 install openpyxl
pip3 install openpyxl-image-loader
Then, in the script :
#Importing the modules
import openpyxl
from openpyxl_image_loader import SheetImageLoader
#loading the Excel File and the sheet
pxl_doc = openpyxl.load_workbook('myfile.xlsx')
sheet = pxl_doc['Sheet_name']
#calling the image_loader
image_loader = SheetImageLoader(sheet)
#get the image (put the cell you need instead of 'A1')
image = image_loader.get('A1')
#showing the image
image.show()
#saving the image
image.save('my_path/image_name.jpg')
In the end, I can store the path and the image name in my dictionaries in a loop for each row
Upvotes: 31
Reputation: 207485
There may be far better solutions, but I thought I would share what I know in case that is good enough.
An Excel .xlsx
file is a actually a zip-file. So you can read it with 7z and probably also with Python Zipfile. Just demonstrating in Terminal:
# List contents
7z l a.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,12 CPUs x64)
Scanning the drive for archives:
1 file, 596240 bytes (583 KiB)
Listing archive: a.xlsx
--
Path = a.xlsx
Type = zip
Physical Size = 596240
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2020-05-27 02:36:54 ..... 2371 563 xl/drawings/drawing1.xml
2020-05-27 02:36:54 ..... 561 198 xl/drawings/_rels/drawing1.xml.rels
2020-05-27 02:36:54 ..... 1781 565 xl/worksheets/sheet1.xml
2020-05-27 02:36:54 ..... 298 179 xl/worksheets/_rels/sheet1.xml.rels
2020-05-27 02:36:54 ..... 3757 808 xl/theme/theme1.xml
2020-05-27 02:36:54 ..... 427 204 xl/sharedStrings.xml
2020-05-27 02:36:54 ..... 2523 613 xl/styles.xml
2020-05-27 02:36:54 ..... 809 330 xl/workbook.xml
2020-05-27 02:36:54 ..... 697 234 xl/_rels/workbook.xml.rels
2020-05-27 02:36:54 ..... 296 178 _rels/.rels
2020-05-27 02:36:54 ..... 156683 156657 xl/media/image2.png
2020-05-27 02:36:54 ..... 46848 46853 xl/media/image1.png
2020-05-27 02:36:54 ..... 386512 386632 xl/media/image3.png
2020-05-27 02:36:54 ..... 1099 320 [Content_Types].xml
------------------- ----- ------------ ------------ ------------------------
2020-05-27 02:36:54 604662 594334 14 files
You can then extract the files and look at the images with:
7z x a.xlsx
Another option might be to save the Excel file as a PDF, you can then run pdfimages
from the Poppler package and extract the images:
pdfimages -png YourSpreadsheet.pdf extracted
Sample Output
-rw-r--r--@ 1 mark staff 92973 27 May 10:57 extracted-000.png
-rw-r--r--@ 1 mark staff 28074 27 May 10:57 extracted-001.png
-rw-r--r--@ 1 mark staff 189 27 May 10:57 extracted-002.png
-rw-r--r--@ 1 mark staff 244898 27 May 10:57 extracted-003.png
Upvotes: 2