Reputation: 155
I'm trying to write a Pandas script which to extract data from several Excel files. They contain between 10 and 15 columns. From these columns I need the 1st one which has different header in every file, and some other columns which always have the same header names ('TOTAL', 'CLEAR', 'NON-CLEAR'and 'SYSTEM') but they are positioned under different column index in the different files. (I mean that in one of the files 'TOTAL' is the 3rd column in the table but in another file it is the 5th column)
I know that using usecols
keyword I could specify which columns to use, but it looks like this argument takes only header names or only column indices, and never both of them in a combination.
Is it possible to write a statement which to take at the same time the 1st column by its index and then the other ones by header name?
The below statement doesn't work:
df = pd.read_excel(file, usecols = [0,'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
Upvotes: 1
Views: 1214
Reputation: 5012
Here you go:
def callable():
first_column = None
def process(column_name):
nonlocal first_column
if first_column is None:
first_column = column_name
return True
if first_column == column_name:
return True
return column_name in ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM']
return process
print(pd.read_csv(file, usecols=callable()))
Upvotes: 0
Reputation: 281
If it has only fifteen columns, it is probably faster not to read the file twice. You can read the whole file into memory and then extract the columns you need with the much nicer pandas interface:
df = pd.read_excel(file)
df = df[ [ df.columns[0], 'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'] ]
Upvotes: 0
Reputation: 865
you could use pd.read_excel()
twice and than join both dfs
df1 = pd.read_excel(file, usecols = [0])
df2 = pd.read_excel(file, usecols = ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
df = pd.concat([df1, df2], axis = 1, join = 'outer')
Upvotes: 1