Reputation: 7
I have a big dataset with 88 Excel sheets in one excel file.
I want to sort the data from column 4 and 6 (not index). Every sheet have the same pattern of columns. The pattern of sheets I want to sort is the following: Sheet 1,2,3,4,...,9,10,11,12,...,17,18,19,20.... and so on (the last one's would be sheets 81,82,83 and 84).
The column "4" contains names and should be sorted alphabetical and the column 6 contains numbers and should be sorted in ascending order. The first row of every sheet contains the column name, this should not be sorted of course.
How can I do that with Python? I don't know how to deal with that problem after I load the data..
df = pd.read_excel("dataset.xlsx", na_values=['NAN'])
Upvotes: 0
Views: 92
Reputation: 185
You could try something like below:
# Import statement
import pandas as pd
# Generate our list of target sheets
sheet_skip_amount = 5 # Number of sheets to skip between groups
end_of_first_sheetlist = 3 # Python lists start at 0
targetRange = [0,1,2,3]
sheetCount = 87 # One less than 88 because lists start at 0
sheetStart = 0
listRepetitions = sheetCount / len(targetRange)
target_sheetList = []
lastVal = 0
# Create our target list of sheet numbers
while lastVal+4 < sheetCount:
tempList = [((x-sheetStart) + (sheetStart * sheet_skip_amount)) for x in targetRange]
target_sheetList = target_sheetList + tempList
lastVal = tempList[-1]
sheetStart = sheetStart + 2
# Debug print statement
print(target_sheetList)
# Read in the sheet names from our Excel sheet
SheetNames = pd.ExcelFile("dataset.xlsx").sheet_names
# Only get the Excel sheet names that match our target sheet index numbers as calculated above
SheetNames_Sorted = [x for x in SheetNames if SheetNames.index(x) in target_sheetList]
# Read in our information from Excel - specifically remove the header values, and only use columns 4 and 6 (3 and 5)
sheets_dict = pd.read_excel("dataset.xlsx", na_values=['NAN'], sheet_name=SheetNames_Sorted, header=0, usecols=[3,5])
# Set some variables to hold the values from columns 4 and 6
column4Values = []
column6Values = []
#Iterate over each sheet in our Excel workbook
for name, sheet in sheets_dict.items():
#select column with index position 4
col4 = sheet.iloc[:, 0].tolist()
column4Values = column4Values + col4
#select column with index position 6
col6 = sheet.iloc[:, 1].tolist()
column6Values = column6Values + col6
#Finally, sort our data
column4Values.sort()
column6Values.sort()
#Print our output
print(column4Values, column6Values)
The general idea is to:
Upvotes: 0