jayjunior
jayjunior

Reputation: 145

Python to extract the first sheet from all excel files in a folder

I have some files in a folder. most but not all in excel format. The files names are all like this:

doe_jane1234_question_12345_123456_JaneDoe.xlsx doe_john6578_question_56789_567890_JohnDoe.xlsx

Using Python, I'd like to open each excel file, grab the first sheet, and paste it into a new excel file and then rename the newly created sheets in the new file to

doe_jane

doe_john

The new Excel file will contain multiple sheets, each corresponding to a file the sheet was copied from.

This is the code I have come up with so far: first I create an empty excel file called output.xlsx in my directory.

import pandas as pd
import os
import re

cwd = os.path.join("C:\\directory")
 
files = os.listdir(cwd) 
outputpath = os.path.join(cwd, 'output.xlsx') 

for i in files:
    # sheet name is the string before the firs number
    sheetname = re.search(r'^[^\d]*', i)    
    input_file_path = os.path.join(cwd,i)
    df_in = pd.read_excel(input_file_path ) 
    with pd.ExcelWriter(outputpath , engine="openpyxl", mode='a') as writer:
        df_in.to_excel(writer, sheet_name=sheetname)
   
    

when I run this I get an error:

  File "C:\Anaconda3\lib\site-packages\openpyxl\workbook\child.py", line 90, in title
    m = INVALID_TITLE_REGEX.search(value)

TypeError: expected string or bytes-like object

Upvotes: 1

Views: 2289

Answers (1)

There's a simple issue in your code which prevents it to work. The issue is that you're not taking a string as output from re.search. Instead, it returns an object from class re.Match (which cannot be interpreted as sheet name). See:

re_output = re.search(r'^[^\d]*', 'doe_john6578_question_56789_567890_JohnDoe.xlsx')
print(re_output)

# <re.Match object; span=(0, 8), match='doe_john'>

We need to use .group() to get the match:

print(re_output.group())

# doe_john

So, changing the line from sheetname = re.search(r'^[^\d]*', i) to sheetname = re.search(r'^[^\d]*', i).group() should work.


That said, once you're creating an empty excel file (with an empty sheet) and it is in the same folder you're getting the first sheet of excel files, you'll end up with two empty sheets. You can avoid that by creating the file just as needed. Take a look a the code below:

import pandas as pd
import os
import re

cwd = os.path.join("C:\\directory")
 
files = os.listdir(cwd) 
outputpath = os.path.join(cwd, 'output.xlsx')

with pd.ExcelWriter(outputpath) as writer:
    for file in files:
        sheetname = re.search(r'^[^\d]*', file).group()
        input_file_path = os.path.join(cwd, file)
        pd.read_excel(input_file_path).to_excel(writer, sheet_name=sheetname) 

Update:

Based on new OP requirements (to keep the formulas), we can leave pandas and use openpyxl directly. See this answer for possible copying solutions. Also, if you want to include the styles as well, take a look at this other answer.

The modified code could be:

# Import libraries
import os, re
from openpyxl import Workbook, load_workbook

# Set paths
cwd = os.path.join("C:\\directory")
outputpath = os.path.join(cwd, 'output.xlsx')

# Get file names
files = os.listdir(cwd)

# Create a new empty workbook
wb = Workbook()

# Delete the first empty sheet (created by default when calling `Workbook()`)
del wb[wb.sheetnames[0]]

# Then, for each file
for file in files:
    
    # Get sheet name based on file name
    sheetname = re.search(r'^[^\d]*', file).group()
    
    # Get full path to the file
    input_file_path = os.path.join(cwd, file)
    
    # Read file and get the first sheet of the file
    ws = load_workbook(input_file_path).worksheets[0]
    
    # Create a new empty sheet to the output file
    wb.create_sheet(sheetname)

    # Add content to the new sheet
    for row in ws:
        for cell in row:
            wb[sheetname][cell.coordinate].value = cell.value

# Save file
wb.save(outputpath)

Upvotes: 3

Related Questions