Reputation: 775
Data: I have a fairly large excel file with more than 20 columns. Each cell contains comments.
Desired Goal: I am trying to read all the comments from column M named 'Engine' from the first row till the last row.
Desired Output: I want to extract all the comments in column M and save them in a list or pandas data frame.
Below is what I tried after reading others' threads:
# load the worksheet for interation
from win32com.client import Dispatch
xlApp = Dispatch("Excel.Application")
workbook = xlApp.Workbooks.Open('My_large_data_file.xls')
worksheet = workbook.Sheets('Mysheet')
# get the row counts for iteration
from openpyxl import load_workbook
wb = load_workbook('My_large_data_file.xls', read_only=True)
sheet = wb.get_sheet_by_name('Mysheet')
row_count = sheet.max_row
comments = []
# iteration
for i in range(2, row_count + 1): # first row is column names
print(i)
comment = worksheet.Cells(i, 13).Comment.Text() # Column M = #13
comments.append(comment)
However, this method only works for cells whose comments are visiable by default. If a cell's comment is invisible, it is read as a NoneType. Then I get error like this:
Traceback (most recent call last):
File "<ipython-input-64-dead2ed27460>", line 5, in <module>
comment = worksheet.Cells(i, 13).Comment.Text() # Column M = #13
AttributeError: 'NoneType' object has no attribute 'Text'
Problem:
1) How can I set all the cells' comments visible so that I can extract them? I am not sure if it needs to apply some VBA code in python.
2) My current codes are not efficient. Especially I am dealing with 60+ such excel files and each contains 70000+ rows. Any suggestions to improve it?
Thanks in advance!
#####################################
There are several status of comments in excel files :
worksheet.Cells(i, j).Comment.Text()
This method works fine for #2 and #3 cases. But it is not working for #1 hidden without indicator case.
Upvotes: 2
Views: 1112
Reputation: 18221
As mentioned in the comment, I am unable to reproduce the issue you mention regarding hidden comments, so I can not comment on that. However, the approach below may well just solve that issue regardless.
Regarding performance, one thing you could try would be to avoid the overhead of COM altogether, as openpyxl
actually has everything you need.
As such, you could do the following:
from openpyxl import load_workbook
wb = load_workbook('My_large_data_file.xls')
sheet = wb.get_sheet_by_name('Mysheet')
comments = [c.comment.text for c in sheet['M'][1:]]
Performance-wise, this should buy you several orders of magnitude as the following 1000 row comparison suggests:
In [64]: %timeit [c.comment.text for c in sheet['M'][1:1000]]
1.31 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit [worksheet.Cells(i, 13).Comment.Text() for i in range(2, 1000)]
1.7 s ± 330 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The difference here comes from the fact that openpyxl
parses the Excel file directly, where win32com
relies on dispatching everything to the Excel process. By going the openpyxl
route, you of course lose the full power of COM, but you will likely find that dipping into COM only makes sense when it is the only option. Here, besides winning a ton of speed, you also do not have to have an Excel process running alongside your script (and indeed, you do not even need to have Excel installed at all), which has the added benefit of making your scripts much more testable.
Upvotes: 2