Reputation: 145
I have some files in a folder. most but not all in excel format. The files names are all like this:
doe_jane1234_question_12345_123456_JaneDoe.xlsx doe_john6578_question_56789_567890_JohnDoe.xlsx
Using Python, I'd like to open each excel file, grab the first sheet, and paste it into a new excel file and then rename the newly created sheets in the new file to
doe_jane
doe_john
The new Excel file will contain multiple sheets, each corresponding to a file the sheet was copied from.
This is the code I have come up with so far: first I create an empty excel file called output.xlsx in my directory.
import pandas as pd
import os
import re
cwd = os.path.join("C:\\directory")
files = os.listdir(cwd)
outputpath = os.path.join(cwd, 'output.xlsx')
for i in files:
# sheet name is the string before the firs number
sheetname = re.search(r'^[^\d]*', i)
input_file_path = os.path.join(cwd,i)
df_in = pd.read_excel(input_file_path )
with pd.ExcelWriter(outputpath , engine="openpyxl", mode='a') as writer:
df_in.to_excel(writer, sheet_name=sheetname)
when I run this I get an error:
File "C:\Anaconda3\lib\site-packages\openpyxl\workbook\child.py", line 90, in title
m = INVALID_TITLE_REGEX.search(value)
TypeError: expected string or bytes-like object
Upvotes: 1
Views: 2289
Reputation: 4929
There's a simple issue in your code which prevents it to work. The issue is that you're not taking a string as output from re.search
. Instead, it returns an object from class re.Match
(which cannot be interpreted as sheet name). See:
re_output = re.search(r'^[^\d]*', 'doe_john6578_question_56789_567890_JohnDoe.xlsx')
print(re_output)
# <re.Match object; span=(0, 8), match='doe_john'>
We need to use .group()
to get the match:
print(re_output.group())
# doe_john
So, changing the line from sheetname = re.search(r'^[^\d]*', i)
to sheetname = re.search(r'^[^\d]*', i).group()
should work.
That said, once you're creating an empty excel file (with an empty sheet) and it is in the same folder you're getting the first sheet of excel files, you'll end up with two empty sheets. You can avoid that by creating the file just as needed. Take a look a the code below:
import pandas as pd
import os
import re
cwd = os.path.join("C:\\directory")
files = os.listdir(cwd)
outputpath = os.path.join(cwd, 'output.xlsx')
with pd.ExcelWriter(outputpath) as writer:
for file in files:
sheetname = re.search(r'^[^\d]*', file).group()
input_file_path = os.path.join(cwd, file)
pd.read_excel(input_file_path).to_excel(writer, sheet_name=sheetname)
Update:
Based on new OP requirements (to keep the formulas), we can leave pandas and use openpyxl
directly. See this answer for possible copying solutions. Also, if you want to include the styles as well, take a look at this other answer.
The modified code could be:
# Import libraries
import os, re
from openpyxl import Workbook, load_workbook
# Set paths
cwd = os.path.join("C:\\directory")
outputpath = os.path.join(cwd, 'output.xlsx')
# Get file names
files = os.listdir(cwd)
# Create a new empty workbook
wb = Workbook()
# Delete the first empty sheet (created by default when calling `Workbook()`)
del wb[wb.sheetnames[0]]
# Then, for each file
for file in files:
# Get sheet name based on file name
sheetname = re.search(r'^[^\d]*', file).group()
# Get full path to the file
input_file_path = os.path.join(cwd, file)
# Read file and get the first sheet of the file
ws = load_workbook(input_file_path).worksheets[0]
# Create a new empty sheet to the output file
wb.create_sheet(sheetname)
# Add content to the new sheet
for row in ws:
for cell in row:
wb[sheetname][cell.coordinate].value = cell.value
# Save file
wb.save(outputpath)
Upvotes: 3