Terminator6677
Terminator6677

Reputation: 7

Sort columns from multiple excel sheets

I have a big dataset with 88 Excel sheets in one excel file.

I want to sort the data from column 4 and 6 (not index). Every sheet have the same pattern of columns. The pattern of sheets I want to sort is the following: Sheet 1,2,3,4,...,9,10,11,12,...,17,18,19,20.... and so on (the last one's would be sheets 81,82,83 and 84).

The column "4" contains names and should be sorted alphabetical and the column 6 contains numbers and should be sorted in ascending order. The first row of every sheet contains the column name, this should not be sorted of course.

How can I do that with Python? I don't know how to deal with that problem after I load the data..

df = pd.read_excel("dataset.xlsx", na_values=['NAN'])

Upvotes: 0

Views: 92

Answers (1)

jrynes
jrynes

Reputation: 185

You could try something like below:

# Import statement
import pandas as pd

# Generate our list of target sheets
sheet_skip_amount = 5 # Number of sheets to skip between groups
end_of_first_sheetlist = 3 # Python lists start at 0
targetRange = [0,1,2,3]
sheetCount = 87 # One less than 88 because lists start at 0
sheetStart = 0
listRepetitions = sheetCount / len(targetRange)
target_sheetList = []
lastVal = 0

# Create our target list of sheet numbers
while lastVal+4 < sheetCount:
    tempList = [((x-sheetStart) + (sheetStart * sheet_skip_amount)) for x in targetRange]
    target_sheetList = target_sheetList + tempList
    lastVal = tempList[-1]
    sheetStart = sheetStart + 2

# Debug print statement
print(target_sheetList)

# Read in the sheet names from our Excel sheet
SheetNames = pd.ExcelFile("dataset.xlsx").sheet_names

# Only get the Excel sheet names that match our target sheet index numbers as calculated above
SheetNames_Sorted = [x for x in SheetNames if SheetNames.index(x) in target_sheetList]

# Read in our information from Excel - specifically remove the header values, and only use columns 4 and 6 (3 and 5)
sheets_dict = pd.read_excel("dataset.xlsx", na_values=['NAN'], sheet_name=SheetNames_Sorted, header=0, usecols=[3,5])

# Set some variables to hold the values from columns 4 and 6
column4Values = []
column6Values = []

#Iterate over each sheet in our Excel workbook
for name, sheet in sheets_dict.items():
    #select column with index position 4
    col4 = sheet.iloc[:, 0].tolist()
    column4Values = column4Values + col4
    #select column with index position 6
    col6 = sheet.iloc[:, 1].tolist()
    column6Values = column6Values + col6

#Finally, sort our data
column4Values.sort()
column6Values.sort()

#Print our output
print(column4Values, column6Values)

The general idea is to:

  1. Generate the list of target index items that you want (1,2,3,4,...,9,10,11,12,...,17,18,19,20....etc)
  2. From our target Excel sheets, get the values from column 4 and 6 (lists start at 0 in Python, so those are actually 3 and 5), and add those values to a list
  3. Sort those two lists

Upvotes: 0

Related Questions